Adding geom_line between data points with different geom_boxplot fill variable - r

Hi I have a much larger data frame but a sample dummy df is as follows:
set.seed(23)
df = data.frame(name = c(rep("Bob",8),rep("Tom",8)),
topic = c(rep(c("Reading","Writing"),8)),
subject = c(rep(c("English","English","Spanish","Spanish"),4)),
exam = c(rep("First",4),rep("Second",4),rep("First",4),rep("Second",4)),
score = sample(1:100,16))
I have to plot it in the way shown in the picture below (for my original data frame) but with lines connecting the scores corresponding to each name between the first and second class in the exam variable, I tried geom_line(aes(group=name)) but the lines are not connected in the right way. Is there any way to connect the points that also respects the grouping by the fill variable similar to how the position_dodge() helps separate the points by their fill grouping? Thanks a lot!
library(ggplot2)
df %>% ggplot(aes(x=topic,y=score,fill=exam)) +
geom_boxplot(outlier.shape = NA) +
geom_point(size=1.75,position = position_dodge(width = 0.75)) +
facet_grid(~subject,switch = "y")

One option to achieve your desired result would be to group the lines by name and topic and do the dodging of lines manually instead of relying on position_dogde. To this end convert topic to a numeric for the geom_line and shift the position by the necessary amount to align the lines with the dodged points:
set.seed(23)
df <- data.frame(
name = c(rep("Bob", 8), rep("Tom", 8)),
topic = c(rep(c("Reading", "Writing"), 8)),
subject = c(rep(c("English", "English", "Spanish", "Spanish"), 4)),
exam = c(rep("First", 4), rep("Second", 4), rep("First", 4), rep("Second", 4)),
score = sample(1:100, 16)
)
library(ggplot2)
ggplot(df, aes(x = topic, y = score, fill = exam)) +
geom_boxplot(outlier.shape = NA) +
geom_point(size = 1.75, position = position_dodge(width = 0.75)) +
geom_line(aes(
x = as.numeric(factor(topic)) + .75 / 4 * ifelse(exam == "First", -1, 1),
group = interaction(name, topic)
)) +
facet_grid(~subject, switch = "y")

Related

display mean value (rearrange data frame?)

I want to boxplot two groups (A and B) and display the mean value on each box plot.
I have 30 lines and 2 columns : each line contains the value of group A (col 1) and group B (col 2).
I did a boxplot with graphic boxplot
boxplot(Data_Q4$Group.A,Data_Q4$Group.B,names=c("group A","group B"))
but it seems like adding a mean point on the boxplot necessiting ggplot 2.
I tried many things but it already send me an error message
! Aesthetics must be either length 1 or the same as the data (30): x...
It seems my problem come from y axis. I need him to take the data from columns A and B but I don't know how to do this.
if my data was with value column and group columns (A or B for each line) it would work but I don't know how to rearrange it so that I get 2 columns (value and groups) and 60 lines with the values of the groups.
and then I do dataQ4 %>% ggplot(aes(x=group,y=value))+geom_boxplot+stat_summary(fun.y=mean)
I think it will be ok.
so my problem is to rearrange my data frame so that I can use ggplot and boxplot it
thanks for your help !
I share here my data :
dput(Data_Q4) structure(list(Group.A = c(1.25310535, 0.5546414, 0.301283, 1.29312466, 0.99455579, 0.5141743, 2.0078324, 0.42224244, 2.17877257, 3.21778902, 0.55782935, 0.59461765, 0.97739581, 0.20986658, 0.30944786, 1.10593627, 0.77418776, 0.08967408, 1.10817666, 0.24726425, 1.57198685, 4.83281274, 0.43113213, 2.73038931, 1.13683142, 0.81336825, 0.83700649, 1.7847654, 2.31247163, 2.90988727), Group.B = c(2.94928948, 0.70302878, 0.69016263, 1.25069011, 0.43649776, 0.22462232, 0.39231981, 1.5763435, 0.42792839, 0.19608026, 0.37724368, 0.07071508, 0.03962611, 0.38580831, 2.63928857, 0.78220807, 0.66454197, 0.9568569, 0.02484568, 0.21600677, 0.88031195, 0.13567357, 0.68181725, 0.20116062, 0.4834762, 0.50102846, 0.15668497, 0.71992076, 0.68549794, 0.86150777)), class = "data.frame", row.names = c(NA, -30L))
First I create some random data:
df <- data.frame(group = rep(c("A", "B"), 15),
value = runif(30, 0, 10))
You can use the following code:
library(tidyverse)
ggplot(data = df,
aes(x = group, y = value)) +
geom_boxplot() +
stat_summary(fun.y = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3,
show.legend = FALSE)
Output:
The red dots represent the mean.
Using your data:
You can use the following code:
library(tidyverse)
library(reshape)
dataQ4 %>%
melt() %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
stat_summary(fun.y = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3,
show.legend = FALSE)
Output:

Time series density plot across factors for large data set

I want to make time based density plots of a variable for the factor water use management for the data set as follows
Seed(123)
ID = rep(c("BAU","IMP","SGR","CR"), each=25)
Time = rep (c(1,2,3,4,5), each = 20)
data <- data.frame( ID, Time, profits = runif(100,0,1))
I am using the following codes to make density plot for profits across IDs. or may be facet_wrap for group or time?
library(ggridges)
ggplot(
data, aes(x = profits, y=as.factor( Time), group = ID, fill=stat(x))) +
geom_density_ridges_gradient(scale = 3, size = 0.3, rel_min_height = 0.01) +
scale_fill_viridis_c(name = "Profits", option = "C") +
labs(title = 'Total Profits') + facet_wrap(~ID, scales = "free")+
theme_classic()
It gives nice density plots. But for the large data of million rows as I have originally; for instance as given below:
Seed(123)
ID = rep(c("BAU","IMP","SGR","CR"), each=5000)
Time = rep (c(1:1000), each = 20)
data <- data.frame( ID, Time, profits = runif(20000,0,1))
the codes give an untidy graph. Can we make factor of time say in interval of 20 each to make the graph understandable? or if any other faceting or wrapping can improve the visibility of graph. Please help
Thanks
take a look at this:
https://stackoverflow.com/a/55852972/13142581
Seed(123)
ID = rep(c("BAU","IMP","SGR","CR"), each=25)
Time = rep (c(1,2,3,4,5), each = 20)
data <- data.frame( ID, Time, profits = runif(100,0,1))
library(ggridges)
ggplot(
data, aes(x = profits, y=as.character(ID), group = as.character(Time), fill=stat(x))) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_viridis_c(name = "Profits", option = "C") +
labs(title = 'Total Profits') + facet_wrap(~ID, scales = "free")+
theme_classic()

Side by Side R Barplot with error bars

say I have the means of two datasets that I want to plot as barplots with error bars next to each other in ggplot2, or base
Each dataset consists of a matrix of numbers
10 20 12
10 20 12
10 20 12
which is then transformed into a mean vector of for example 3 elements
10 20 12
What I want to do is to take both mean vectors and plot them as a bar plot where the first element of one is besides the first element of the other
Dataset1Element1Bar-Dataset2Element1Bar Dataset1Element2Bar-Dataset2Element2Bar etc
Give each bar an error bar, say of standard deviation. I know I can calculate it through sd but I'm not sure how to stick it into the graph in the proper form
And lastly color them by their element number (ie Element 1)
I have the code to do one dataset but I'm not sure where to go from there.
result<-barplot(bardata, main="Mean Coverage", names.arg=namePosTargetGroup, ylab="mean Magnitude", cex.names=.4,col=c("red","blue","green"))
legend(10,legend=c("Group1","Group2","Group3"),fill = c("red","blue","green"))
A lot of what I look up gives the answer for one thing or another but its difficult to figure out how to combine them together.
I would generally not recommend plotting just a bar chart with error bars. There are many other ways to plot your data, which reveal the data and its structure a lot better.
Especially if you just have very few cases, plotting means with bars is not good. A good explanation can be found here: Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm
I find it difficult to give you a good solution, since I don't know your research-question. Knowing what you actually want to show or emphasis would make things easier.
I will give you two suggestions, one for a small dataset, one for a bigger one. All of them are created with ggplot2. I'm not coloring them by their "element number" but by their origin ("dataset 1/2"), since I find it easier to accomplish a proper graphic this way.
Small Dataset
Use geom_jitter to display all your cases, avoiding overplotting.
# import hadleyverse
library(magrittr)
library(dplyr)
library(tidyr)
library(ggplot2)
# generate small amount of data
set.seed(1234)
df1 <- data.frame(v1 = rnorm(5, 4, 1),
v2 = rnorm(5, 5, 1),
v3 = rnorm(5, 6, 1),
origin = rep(factor("df1", levels = c("df1", "df2")), 5))
df2 <- data.frame(v1 = rnorm(5, 4.5, 1),
v2 = rnorm(5, 5.5, 1),
v3 = rnorm(5, 6.5, 1),
origin = rep(factor("df2", levels = c("df1", "df2")), 5))
# merge dataframes and gather in long format
pdata <- bind_rows(df1, df2) %>%
gather(id, variable, -origin)
# plot data
ggplot(pdata, aes(x = id, y = variable, fill = origin, colour = origin)) +
stat_summary(fun.y = mean, geom = "point", position = position_dodge(width = .5),
size = 30, shape = "-", show_guide = F, alpha = .7) + # plot mean as "-"
geom_jitter(position = position_jitterdodge(jitter.width = .3, jitter.height = .1,
dodge.width = .5),
size = 4, alpha = .85) +
labs(x = "Variable", y = NULL) + # adjust legend
theme_light() # nicer theme
"Big" Dataset
If you have more datapoints, you can use geom_violin to summarise them.
set.seed(12345)
df1 <- data.frame(v1 = rnorm(50, 4, 1),
v2 = rnorm(50, 5, 1),
v3 = rnorm(50, 6, 1),
origin = rep(factor("df1", levels = c("df1", "df2")), 50))
df2 <- data.frame(v1 = rnorm(50, 4.5, 1),
v2 = rnorm(50, 5.5, 1),
v3 = rnorm(50, 6.5, 1),
origin = rep(factor("df2", levels = c("df1", "df2")), 50))
# merge dataframes
pdata <- bind_rows(df1, df2) %>%
gather(id, variable, -origin)
# plot with violin plot
ggplot(pdata, aes(x = id, y = variable, fill = origin)) +
geom_violin(adjust = .6) +
stat_summary(fun.y = mean, geom = "point", position = position_dodge(width = .9),
size = 6, shape = 4, show_guide = F) +
guides(fill = guide_legend(override.aes = list(colour = NULL))) +
labs(x = "Variable", y = NULL) +
theme_light()
Version with mean and sd
If you insist on plotting the mean with standard deviation, here is how it could be done.
# merge dataframes and compute limits for sd
pdata <- bind_rows(df1, df2) %>%
gather(id, variable, -origin) %>%
group_by(origin, id) %>% # group data for limit calculation
mutate(upper = mean(variable) + sd(variable), # upper limit for error bar
lower = mean(variable) - sd(variable)) # lower limit for error bar
# plot
ggplot(pdata, aes(x = id, y = variable, fill = origin)) +
stat_summary(fun.y = mean, geom = "bar", position = position_dodge(width = .9),
size = 3) +
geom_errorbar(aes(ymin = lower, ymax = upper),
width = .2, # Width of the error bars
position = position_dodge(.9))

Bar plot of group means with lines of individual results overlaid

this is my first stack overflow post and I am a relatively new R user, so please go gently!
I have a data frame with three columns, a participant identifier, a condition (factor with 2 levels either Placebo or Experimental), and an outcome score.
set.seed(1)
dat <- data.frame(Condition = c(rep("Placebo",10),rep("Experimental",10)),
Outcome = rnorm(20,15,2),
ID = factor(rep(1:10,2)))
I would like to construct a bar plot with two bars with the mean outcome score for each condition and the standard deviation as an error bar. I would like to then overlay lines connecting points for each participant's score in each condition. So the plot displays the individual response as well as the group mean.If it is also possible I would like to include an axis break.
I don't seem to be able to find any advice in other threads, apologies if I am repeating a question.
Many Thanks.
p.s. I realise that presenting data in this way will not be to everyones tastes. It is for a specific requirement!
This ought to work:
library(ggplot2)
library(dplyr)
dat.summ <- dat %>% group_by(Condition) %>%
summarize(mean.outcome = mean(Outcome),
sd.outcome = sd(Outcome))
ggplot(dat.summ, aes(x = Condition, y = mean.outcome)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean.outcome - sd.outcome,
ymax = mean.outcome + sd.outcome),
color = "dodgerblue", width = 0.3) +
geom_point(data = dat, aes(x = Condition, y = Outcome),
color = "firebrick", size = 1.2) +
geom_line(data = dat, aes(x = Condition, y = Outcome, group = ID),
color = "firebrick", size = 1.2, alpha = 0.5) +
scale_y_continuous(limits = c(0, max(dat$Outcome)))
Some people are better with ggplot's stat functions and arguments than I am and might do it differently. I prefer to just transform my data first.
set.seed(1)
dat <- data.frame(Condition = c(rep("Placebo",10),rep("Experimental",10)),
Outcome = rnorm(20,15,2),
ID = factor(rep(1:10,2)))
dat.w <- reshape(dat, direction = 'wide', idvar = 'ID', timevar = 'Condition')
means <- colMeans(dat.w[, 2:3])
sds <- apply(dat.w[, 2:3], 2, sd)
ci.l <- means - sds
ci.u <- means + sds
ci.width <- .25
bp <- barplot(means, ylim = c(0,20))
segments(bp, ci.l, bp, ci.u)
segments(bp - ci.width, ci.u, bp + ci.width, ci.u)
segments(bp - ci.width, ci.l, bp + ci.width, ci.l)
segments(x0 = bp[1], x1 = bp[2], y0 = dat.w[, 2], y1 = dat.w[, 3], col = 1:10)
points(c(rep(bp[1], 10), rep(bp[2], 10)), dat$Outcome, col = 1:10, pch = 19)
Here is a method using the transfomations inside ggplot2
ggplot(dat) +
stat_summary(aes(x=Condition, y=Outcome, group=Condition), fun.y="mean", geom="bar") +
stat_summary(aes(x=Condition, y=Outcome, group=Condition), fun.data="mean_se", geom="errorbar", col="green", width=.8, size=2) +
geom_line(aes(x=Condition, y=Outcome, group=ID), col="red")

How to prevent two labels to overlap in a barchart?

The image below shows a chart that I created with the code below. I highlighted the missing or overlapping labels. Is there a way to tell ggplot2 to not overlap labels?
week = c(0, 1, 1, 1, 1, 2, 2, 3, 4, 5)
statuses = c('Shipped', 'Shipped', 'Shipped', 'Shipped', 'Not-Shipped', 'Shipped', 'Shipped', 'Shipped', 'Not-Shipped', 'Shipped')
dat <- data.frame(Week = week, Status = statuses)
p <- qplot(factor(Week), data = dat, geom = "bar", fill = factor(Status))
p <- p + geom_bar()
# Below is the most important line, that's the one which displays the value
p <- p + stat_bin(aes(label = ..count..), geom = "text", vjust = -1, size = 3)
p
You can use a variant of the well-known population pyramid.
Some sample data (code inspired by Didzis Elferts' answer):
set.seed(654)
week <- sample(0:9, 3000, rep=TRUE, prob = rchisq(10, df = 3))
status <- factor(rbinom(3000, 1, 0.15), labels = c("Shipped", "Not-Shipped"))
data.df <- data.frame(Week = week, Status = status)
Compute count scores for each week, then convert one category to negative values:
library("plyr")
plot.df <- ddply(data.df, .(Week, Status), nrow)
plot.df$V1 <- ifelse(plot.df$Status == "Shipped",
plot.df$V1, -plot.df$V1)
Draw the plot. Note that the y-axis labels are adapted to show positive values on either side of the baseline.
library("ggplot2")
ggplot(plot.df) +
aes(x = as.factor(Week), y = V1, fill = Status) +
geom_bar(stat = "identity", position = "identity") +
scale_y_continuous(breaks = 100 * -1:5,
labels = 100 * c(1, 0:5)) +
geom_text(aes(y = sign(V1) * max(V1) / 30, label = abs(V1)))
The plot:
For production purposes you'd need to determine the appropriate y-axis tick labels dynamically.
Made new sample data (inspired by code of #agstudy).
week <- sample(0:5,1000,rep=TRUE,prob=c(0.2,0.05,0.15,0.5,0.03,0.1))
statuses <- gl(2,1000,labels=c('Not-Shipped', 'Shipped'))
dat <- data.frame(Week = week, Status = statuses)
Using function ddply() from library plyr made new data frame text.df for labels. Column count contains number of observations in each combination of Week and Status. Then added column ypos that contains cumulative sum of count for each Week plus 15. This will be used for y position. For Not-Shipped ypos replaced with -10.
library(plyr)
text.df<-ddply(dat,.(Week,Status),function(x) data.frame(count=nrow(x)))
text.df<-ddply(text.df,.(Week),transform,ypos=cumsum(count)+15)
text.df$ypos[text.df$Status=="Not-Shipped"]<- -10
Now labels are plotted with geom_text() using new data frame.
ggplot(dat,aes(as.factor(Week),fill=Status))+geom_bar()+
geom_text(data=text.df,aes(x=as.factor(Week),y=ypos,label=count))
One solution to avoid overlaps is to use to dodge position of bars and texts. To avoid missing values you can set ylim. Here an example.
## I create some more realistic data similar to your picture
week <- sample(0:5,1000,rep=TRUE)
statuses <- gl(2,1000,labels=c('Not-Shipped', 'Shipped'))
dat <- data.frame(Week = week, Status = statuses)
## for dodging
dodgewidth <- position_dodge(width=0.9)
## get max y to set ylim
ymax <- max(table(dat$Week,dat$Status))+20
ggplot(dat,aes(x = factor(Week),fill = factor(Status))) +
geom_bar( position = dodgewidth ) +
stat_bin(geom="text", position= dodgewidth, aes( label=..count..),
vjust=-1,size=5)+
ylim(0,ymax)
Based on Didzis plot you could also increase readability by keeping the position on the y axis constant and by colouring the text in the same colour as the legend.
library(ggplot2)
week <- sample(0:5,1000,rep=TRUE,prob=c(0.2,0.05,0.15,0.5,0.03,0.1))
statuses <- gl(2,1000,labels=c('Not-Shipped', 'Shipped'))
dat <- data.frame(Week = week, Status = statuses)
library(plyr)
text.df<-ddply(dat,.(Week,Status),function(x) data.frame(count=nrow(x)))
text.df$ypos[text.df$Status=="Not-Shipped"]<- -15
text.df$ypos[text.df$Status=="Shipped"]<- -55
p <- ggplot(dat,aes(as.factor(Week),fill=Status))+geom_bar()+
geom_text(data=text.df,aes(x=as.factor(Week),y=ypos,label=count),colour=ifelse(text.df$Status=="Not-Shipped","#F8766D","#00BFC4"))

Resources