On the same ggplot figure, I am trying to have the points (from geom_point), the lines (from geom_line) and the errorbars (from geom_errorbar) on the same "plane" (i.e. not overlapping), this for each factor.
As you can see the "layering" of the errorbars is not following the "layering" of the lines (not mentionning the points).
Here is a reproducible example:
# reproducible example
# package
library(dplyr)
library(ggplot2)
# generate the data
set.seed(244)
d1 <- data.frame(time_serie = as.factor(rep(rep(1:3, each = 6), 3)),
treatment = as.factor(rep(c("HIGH", "MEDIUM", "LOW"), each = 18)),
value = runif(54, 1, 10))
# create the error intervals
d2 <- d1 %>%
dplyr::group_by(time_serie,treatment) %>%
dplyr::summarise(mean_value = mean(value),
SE_value = sd(value/sqrt(length(value)))) %>%
as.data.frame()
# plot
p1 <- ggplot(aes(x = time_serie, y = mean_value, color = treatment, group = treatment), data=d2)
p1
p1a <- p1 + geom_errorbar(aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value), width = .2, position = position_dodge(0.3), size =1) +
geom_point(aes(), position = position_dodge(0.3), size = 3) +
geom_line(aes(color = treatment), position=position_dodge(0.3), size =1)
p1a
Any idea?
Any help would be greatly appreciated :)
Thanks a lot!
Valérian
Up front: this is a partial answer that has two notable issues still to fix (see the end). Edit: the two issues have been resolved, see the far bottom.
I'll change the "dodge" slightly to clarify the point, identify an area of concern, and demonstrate a suggested workaround.
# generate the data
set.seed(244)
d1 <- data.frame(time_serie = as.factor(rep(rep(1:3, each = 6), 3)),
treatment = as.factor(rep(c("HIGH", "MEDIUM", "LOW"), each = 18)),
value = runif(54, 1, 10))
# create the error intervals
d2 <- d1 %>%
dplyr::group_by(time_serie,treatment) %>%
dplyr::summarise(mean_value = mean(value),
SE_value = sd(value/sqrt(length(value)))) %>%
dplyr::arrange(desc(treatment)) %>%
as.data.frame()
# plot
ggplot(aes(x = time_serie, y = mean_value, color = treatment, group = treatment), data=d2) +
geom_errorbar(aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value),
width = 0.2, position = position_dodge(0.03), size = 2) +
geom_point(aes(), position = position_dodge(0.03), size = 3) +
geom_line(aes(color = treatment), position = position_dodge(0.03), size = 2)
Namely, I'll assume that we want HIGH (red) points/lines/error-bars as the top-most layer, masked by nothing. We can see a clear violation of this in the right-most bar: the red dot is over the green errorbar but under the green line.
Unless/until there is an aes(layer=..) aesthetic (there is not afaik), you need to add layers one treatment at a time. While one could hard-code this with nine geoms, you can automate this with lapply. Note that ggplot(.) + list(geom1,geom2,geom3) works just fine, even with nested lists.
I'll control the order of layers with rev(levels(d2$treatment)), assuming that you want LOW as the bottom-most layer (ergo added first). The order of geoms within the list is what defines their layers. Technically we still have a single treatment's errorbar, point, and line on different layers, but they are consecutive so appear to be the same.
ggplot(aes(x = time_serie, y = mean_value, color = treatment, group = treatment), data=d2) +
lapply(rev(levels(d2$treatment)), function(trtmnt) {
list(
geom_errorbar(data = ~ subset(., treatment == trtmnt),
aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value),
width = 0.2, position = position_dodge(0.03), size = 2),
geom_point(data = ~ subset(., treatment == trtmnt), aes(), position = position_dodge(0.03), size = 3),
geom_line(data = ~ subset(., treatment == trtmnt), position = position_dodge(0.03), size = 2)
)
})
(Side note: I use levels(d2$treatment) and data=~subset(., treatment==trtmnt) here, but that's just one way to do it. Another would be lapply(split(d2, d2$treatment), function(x) ...) and use data=x in all of the inner geoms. This latter method allows for multi-variable grouping, if desired. I see no immediate advantage to one over the other.)
The problems with this:
The order of the legend is not consistent with the order of levels of the factor, somehow that is lost. (To be clear, I don't demonstrate this very well here: I can move "medium" to the middle of the legend using levels<-, and it works with the non-lapply rendering code with incorrect layering, but it is again lost with the lapply-geoms.)
position_dodge no longer has awareness of the other treatments, so it does not dodge the other errorbars. The only way around this (not demonstrated here) would be to manually dodge before plotting, shown below.
1: Order of legend elements
This one was solved in lapply'd geoms lose factor-ordering, where we just need to add scale_color_discrete(drop=FALSE).
2: Dodging
The dodge issue can be fixed by using real numerics in the x aesthetic. This is kind of a hack, as it is no longer done by ggplot2 but controlled externally. It's also applying an offset and not dodging, per se. But it does get the desired results.
d2$time_serie2 <- as.integer(as.character(d2$time_serie)) + as.numeric(d2$treatment)/10
ggplot(aes(x = time_serie2, y = mean_value, color = treatment, group = treatment), data = d2) +
lapply(rev(levels(d2$treatment)), function(trtmnt) {
list(
geom_errorbar(data = ~ subset(., treatment == trtmnt),
aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value),
width = 0.2, size = 2),
geom_point(data = ~ subset(., treatment == trtmnt), aes(), size = 3),
geom_line(data = ~ subset(., treatment == trtmnt), size = 2)
)
}) +
scale_color_discrete(drop = FALSE)
Related
Reproduced from this code:
library(haven)
library(survey)
library(dplyr)
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
# Rename variables into something more readable
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
nhanesAnalysis <- nhanesDemo %>%
mutate(LowIncome = case_when(
INDFMIN2 < 40 ~ T,
T ~ F
)) %>%
# Select the necessary columns
select(INDFMIN2, LowIncome, persWeight, psu, strata)
# Set up the design
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
svyhist(~log10(INDFMIN2), design=nhanesDesign, main = '')
How do I color the histogram by independent variable, say, LowIncome? I want to have two separate histograms, one for each value of LowIncome. Unfortunately I picked a bad example, but I want them to be see-through in case their values overlap.
If you want to plot a histogram from your model, you can get its data from model.frame (this is what svyhist does under the hood). To get the histogram filled by group, you could use this data frame inside ggplot:
library(ggplot2)
ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
geom_histogram(alpha = 0.5, color = "gray60", breaks = 0:20 / 10) +
theme_classic()
Edit
As Thomas Lumley points out, this does not incorporate sampling weights, so if you wanted this you could do:
ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
geom_histogram(aes(weight = persWeight), alpha = 0.5,
color = "gray60", breaks = 0:20 / 10) +
theme_classic()
To demonstrate this approach works, we can replicate Thomas's approach in ggplot using the data example from svyhist. To get the uneven bin sizes (if this is desired), we need two histogram layers, though I'm guessing this would not be required for most use-cases.
ggplot(model.frame(dstrat), aes(enroll)) +
geom_histogram(aes(fill = "E", weight = pw, y = after_stat(density)),
data = subset(model.frame(dstrat), stype == "E"),
breaks = 0:35 * 100,
position = "identity", col = "gray50") +
geom_histogram(aes(fill = "Not E", weight = pw, y = after_stat(density)),
data = subset(model.frame(dstrat), stype != "E"),
position = "identity", col = "gray50",
breaks = 0:7 * 500) +
scale_fill_manual(NULL, values = c("#00880020", "#88000020")) +
theme_classic()
You can't just extract the data and use ggplot, because that won't use the weights and so misses the whole point of svyhist. You can use the add=TRUE argument, though. You do need to set the x and y axis ranges correctly to make sure the whole plot is visible
Using the data example from ?svyhist
svyhist(~enroll, subset(dstrat,stype=="E"), col="#00880020",ylim=c(0,0.003),xlim=c(0,3500))
svyhist(~enroll, subset(dstrat,stype!="E"), col="#88000020",add=TRUE)
Below I have simulated a dataset where an assignment was given to 5 groups of individuals on 5 different days (a new group with 200 new individuals each day). TrialStartDate denotes the date on which the assignment was given to each individual (ID), and TrialEndDate denotes when each individual finished the assignment.
set.seed(123)
data <-
data.frame(
TrialStartDate = rep(c(sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by="day"), 5)), each = 200),
TrialFinishDate = sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by = "day"), 1000,replace = T),
ID = seq(1,1000, 1)
)
I am interested in comparing how long individuals took to complete the trial depending on when they started the trial (i.e., assuming TrialStartDate has an effect on the length of time it takes to complete the trial).
To visualize this, I want to make a barplot showing counts of IDs on each TrialFinishDate where bars are colored by TrialStartDate (since each TrialStartDate acts as a grouping variable). The best I have come up with so far is by faceting like this:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
facet_wrap(~TrialStartDate, ncol = 1)
However, I also want to add a vertical line to each facet showing when the TrialStartDate was for each group (preferably colored the same as the bars). When attempting to add vertical lines with geom_vline, it adds all the lines to each facet:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(xintercept = unique(data$TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
How can we make the vertical lines unique to the respective group in each facet?
You're specifying xintercept outside of aes, so the faceting is not respected.
This should do the trick:
data %>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(aes(xintercept = TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
Note geom_vline(aes(xintercept = TrialStartDate))
I would like to force render a smoother line for this multi-group plot, even in situations where a group has only one or two values. see below:
library(ggplot2)
set.seed(1234)
df <- data.frame(group = factor(c(rep("A",3),rep("B",2),"C")), x = c(1,2,3,1,2,2), value = runif(6))
ggplot(df,aes(x=x,y=value,group=group,color=group))+
geom_point(size=2)+
geom_line(stat="smooth",method = "loess",size = 2, alpha = 0.3)
Here's The output I want to see:
The call gives a lot of warnings which can be inspected by warnings(). One of the warnings says "zero-width neighborhood. make span bigger".
So, I tried OP's code with the additional span = 1 parameter:
library(ggplot2)
ggplot(df, aes(x = x, y = value, group = group, color = group)) +
geom_point(size = 2) +
geom_line(
stat = "smooth",
method = "loess",
span = 1,
size = 2,
alpha = 0.3
)
and got smoothed curves for groups A and B with only 3 and 2 data points, resp.
I try to plot labels above bars with the stat_summary function and a custom function that I wrote. There are three bars and each should be labeled with the letters a:c, respectively. However, instead of putting one label per bar, all three labels are placed on top of each other:
codes <- c ("a", "b", "c")
simple_y <- function(x) {
return (data.frame (y = mean (x) + 1, label = codes))
}
ggplot (iris, mapping = aes (x = Species, y = Sepal.Length)) +
geom_bar (stat = "summary", fun.y = "mean", fill = "blue", width = 0.7, colour = "black", size = 0.7) +
stat_summary (fun.data = simple_y, geom = "text", size = 10)
I do understand why this is not working: each time the simply_y-function is recycled, it sees the whole codes - vector. However, I have no clue how to tell R to separate the three labels. Is it possible to tell R to subsequently use the n_th element of an input-vector when recycling a function?
Does anybody have a good hint?
I would consider doing something like this:
labels <-
tibble(
Species = factor(c("setosa", "versicolor", "virginica")),
codes = c("a", "b", "c")
)
iris %>%
group_by(Species) %>%
summarize(Mean = mean(Sepal.Length)) %>%
ungroup() %>%
left_join(labels, by = "Species") %>%
ggplot(aes(x = Species, y = Mean)) +
geom_col(fill = "blue", width = 0.7, color = "black", size = 0.7) +
geom_text(aes(y = Mean + 0.3, label = codes), size = 6, show.legend = FALSE)
First, you can generate the data frame with means separately, avoiding the need for geom_bar and stat_summary. Then after joining the manual labels/codes to that summarized data frame, it's pretty straightforward to add them with geom_text.
this is my first stack overflow post and I am a relatively new R user, so please go gently!
I have a data frame with three columns, a participant identifier, a condition (factor with 2 levels either Placebo or Experimental), and an outcome score.
set.seed(1)
dat <- data.frame(Condition = c(rep("Placebo",10),rep("Experimental",10)),
Outcome = rnorm(20,15,2),
ID = factor(rep(1:10,2)))
I would like to construct a bar plot with two bars with the mean outcome score for each condition and the standard deviation as an error bar. I would like to then overlay lines connecting points for each participant's score in each condition. So the plot displays the individual response as well as the group mean.If it is also possible I would like to include an axis break.
I don't seem to be able to find any advice in other threads, apologies if I am repeating a question.
Many Thanks.
p.s. I realise that presenting data in this way will not be to everyones tastes. It is for a specific requirement!
This ought to work:
library(ggplot2)
library(dplyr)
dat.summ <- dat %>% group_by(Condition) %>%
summarize(mean.outcome = mean(Outcome),
sd.outcome = sd(Outcome))
ggplot(dat.summ, aes(x = Condition, y = mean.outcome)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean.outcome - sd.outcome,
ymax = mean.outcome + sd.outcome),
color = "dodgerblue", width = 0.3) +
geom_point(data = dat, aes(x = Condition, y = Outcome),
color = "firebrick", size = 1.2) +
geom_line(data = dat, aes(x = Condition, y = Outcome, group = ID),
color = "firebrick", size = 1.2, alpha = 0.5) +
scale_y_continuous(limits = c(0, max(dat$Outcome)))
Some people are better with ggplot's stat functions and arguments than I am and might do it differently. I prefer to just transform my data first.
set.seed(1)
dat <- data.frame(Condition = c(rep("Placebo",10),rep("Experimental",10)),
Outcome = rnorm(20,15,2),
ID = factor(rep(1:10,2)))
dat.w <- reshape(dat, direction = 'wide', idvar = 'ID', timevar = 'Condition')
means <- colMeans(dat.w[, 2:3])
sds <- apply(dat.w[, 2:3], 2, sd)
ci.l <- means - sds
ci.u <- means + sds
ci.width <- .25
bp <- barplot(means, ylim = c(0,20))
segments(bp, ci.l, bp, ci.u)
segments(bp - ci.width, ci.u, bp + ci.width, ci.u)
segments(bp - ci.width, ci.l, bp + ci.width, ci.l)
segments(x0 = bp[1], x1 = bp[2], y0 = dat.w[, 2], y1 = dat.w[, 3], col = 1:10)
points(c(rep(bp[1], 10), rep(bp[2], 10)), dat$Outcome, col = 1:10, pch = 19)
Here is a method using the transfomations inside ggplot2
ggplot(dat) +
stat_summary(aes(x=Condition, y=Outcome, group=Condition), fun.y="mean", geom="bar") +
stat_summary(aes(x=Condition, y=Outcome, group=Condition), fun.data="mean_se", geom="errorbar", col="green", width=.8, size=2) +
geom_line(aes(x=Condition, y=Outcome, group=ID), col="red")