ggplot2 and jitter/dodge points by a group - r

I have 'elevation' as my y-axis and I want it as a discrete variable (in other words I want the space between each elevation to be equal and not relative to the numerical differences). My x-axis is 'time' (julian date).
mydata2<- data.frame(
"Elevation" = c(rep(c(1200),10),rep(c(1325.5),10),rep(c(1350.75),10), rep(c(1550.66),10)),
"Sex" = c(rep(c("F","M"),20)),
"Type" = c(rep(c("emerge","emerge","endhet","endhet","immerge","immerge","melt","melt", "storpor","storpor"),4)),
"mean" = c(rep(c(104,100,102,80,185,210,84,84,188,208,104,87,101,82, 183,188,83,83,190,189),2))
"se"=c(rep(c(.1,.01,.2,.02,.03),4)))
mydata2$Sex<-factor(mydata2$Sex))
mydata2$Type<-factor(mydata2$Type))
mydata2$Elevation<-factor(mydata2$Elevation))
at<-ggplot(mydata2, aes(y = mean, x = Elevation,color=Type, group=Sex)) +
geom_pointrange(aes(ymin = mean-se, ymax = mean+se),
position=position_jitter(width=0.2,height=.1),
linetype='solid') +
facet_grid(Sex~season,scales = "free")+
coord_flip()
at
Ideally, I would like each 'type' to be separated vertically. When I jitter or dodge only those that are close separate and not evenly. Is there a way to force each 'type' to be slightly shifted so they are all on their own line? I tried to force it by giving each type a slightly different 'elevation' but then I end up with a messy y-axis (I can't figure out a way to keep the point but not display all the tick marks with a discrete scale).
Thank you for your help.

If you want to use a numerical value as a discrete value, you should use as.factor. In your example, try to use x = as.factor(Elevation).
Additionally, I will suggest to use position = position_dodge() to get points from different conditions corresponding to the same elevation to be plot side-by-side
ggplot(mydata2, aes(y = mean, x = as.factor(Elevation),color=Type, group=Sex)) +
geom_pointrange(aes(ymin = mean-se, ymax = mean+se),
position=position_dodge(),
linetype='solid') +
facet_grid(Sex~season,scales = "free")+
coord_flip()
EDIT with example data provided by the OP
Using your dataset, I was not able to get range being plot with your point. So, I create two variable Lower and Upper using dplyr package.
Then, I did not pass your commdnas facotr(...) you provided in your question but instead, I used as.factor(Elevation) and position_dodge(0.9) for the plotting to get the following plot:
library(tidyverse)
mydata2 %>% mutate(Lower = mean-se*100, Upper = mean+se*100) %>%
ggplot(., aes( x = as.factor(Elevation), y = mean, color = Type))+
geom_pointrange(aes(ymin = Lower, ymax = Upper), linetype = "solid", position = position_dodge(0.9))+
facet_grid(Sex~., scales = "free")+
coord_flip()
Does it look what you are looking for ?
Data
Your dataset provided contains few errors (too much parenthesis), so I correct here.
mydata2<- data.frame(
"Elevation" = c(rep(c(1200),10),rep(c(1325.5),10),rep(c(1350.75),10), rep(c(1550.66),10)),
"Sex" = rep(c("F","M"),20),
"Type" = rep(c("emerge","emerge","endhet","endhet","immerge","immerge","melt","melt", "storpor","storpor"),4),
"mean" = rep(c(104,100,102,80,185,210,84,84,188,208,104,87,101,82, 183,188,83,83,190,189),2),
"se"=rep(c(.1,.1,.2,.05,.03),4))

Related

How do I line up my error bars with my bars in ggplot?

I'm creating a bar chart with a pattern for a subset of the bars, and I want to add error bars.
However, I'm having trouble lining up the error bars with with the bar charts—I want to have them appear centered on each bar. How do I do this? Moreover, the legend currently does not clearly distinguish the striped and non-striped bars as corresponding to not treated and treated groups.
Finally, I'd like to create version of this plot which stacks adjacent bars (i.e. bars within each facet_grid)—any tips on how to do that would be much appreciated.
The code I'm using is:
library(ggplot2)
library(tidyverse)
library(ggpattern)
models = c("a", "b")
task = c("1","2")
ratios = c(0.3, 0.4)
standard_errors = c(0.02, 0.02)
ymax = ratios + standard_errors
ymin = ratios - standard_errors
colors = c("#F39B7FFF", "#8491B4FF")
df <- data.frame(task = task, ratios = ratios)
df <- df %>% mutate(filler = 1-ratios)
df <- df %>% gather(key = "obs", value = "ratios", -1)
df$upper <- df$ratios + c(standard_errors,standard_errors)
df$models <- c(models,models)
df$lower <- df$ratios - c(standard_errors,standard_errors)
df$col <- c(colors,colors)
df$group <- paste(df$task, df$models, sep="-")
df$treated <- "yes"
df[df$ratios<0.5,]$treated = "no"
p <- ggplot(df, aes(x = group, y = ratios, fill = col, ymin = lower, ymax = upper)) +
stat_summary(aes(pattern=treated),
fun = "mean", position=position_dodge(),
geom = "bar_pattern", pattern_fill="black", colour="black") +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2, position=position_dodge(0.9)) +
scale_pattern_manual(values=c("none", "stripe"))+ #edited part
facet_grid(.~task,
scales = "free_x", # Let the x axis vary across facets.
space = "free_x", # Let the width of facets vary and force all bars to have the same width.
switch = "x") + guides(colour = guide_legend(nrow = 1)) +
guides(fill = "none")
p
Here is an option
df %>%
ggplot(aes(x = models, y = ratios)) +
geom_col_pattern(
aes(fill = col, pattern = treated),
pattern_fill = "black",
colour = "black",
pattern_key_scale_factor = 0.2,
position = position_dodge()) +
geom_errorbar(
aes(ymin = lower, ymax = upper, group = interaction(task, treated)),
width = 0.2,
position = position_dodge(0.9)) +
facet_grid(~ task, scales = "free_x") +
scale_pattern_manual(values = c("none", "stripe")) +
scale_fill_identity()
A few comments:
I don't understand the point of creating group. IMO this is unnecessary. TBH, I also don't understand the point of models and task: if task = "1" then models = "a"; if task = "2" then models = "b"; so task and models are redundant as they encode the same thing (whether you call it "1"/"2" or "a"/"b").
The reason why you (originally) didn't see a pattern in the legend is because of the scale factor in the legend key. As per ?scale_col_pattern, you can adjust this with the pattern_key_scale_factor parameter. Here, I've chosen pattern_key_scale_factor = 0.2 but you may want to play with different values.
The reason why the error bars didn't align with the dodged bars was because geom_errorbar didn't know that there are different task-treated combinations. We can fix this by explicitly defining a group aesthetic given by the combination of task & treated values. The reason why you don't need this in geom_col_pattern is because you already allow for different treated values through the pattern aesthetic.
You want to use scale_fill_identity() if you already have actual colour values defined in the data.frame.

How to control the order of ggplot2::geom_pointrange elements by colour, shape and linetype

data=data.frame("X"=c(22,5,8,17,7,22),
"XMIN"=c(17.6,4,6.4,13.6,5.6,17.6),
"XMAX"=c(26.4,6,9.6,20.4,8.4,26.4),
"VAR"=c('A','B','C','A','B','C'),
"L1"=c(1,2,3,1,2,3),
"L2"=c(1,1,1,2,2,2))
ggplot(data) +
geom_pointrange(aes(
ymin = XMIN,
ymax = XMAX,
y = X,
x = reorder(VAR, -X),
colour = factor(L1),
shape = factor(L1),
linetype = factor(L2)))
I wish to add space between the lines for each variable A,B,C. Also within (A,B,C) for each variable I wish to sort the line from lowest to highest by X value.
See photo here,enter image description here
This seems to do the trick:
Updated in response to comments so that variables L1 and L2 control the colour, shape and linetype aesthetics.
The really tricky problem was overcoming the conflict between the order imposed by using factor(L2) and that wished for as a combination of VAR and X.
The axis order will trump the linetype order where the x values are distinct.
So created a continuous variable x_loc to locate observations on the x axis which are then re-labelled with the required values from VAR.
library(ggplot2)
library(dplyr)
data=data.frame("X"=c(22,5,8,17,7,22),
"XMIN"=c(17.6,4,6.4,13.6,5.6,17.6),
"XMAX"=c(26.4,6,9.6,20.4,8.4,26.4),
"VAR"=c('A','B','C','A','B','C'),
"L1"=c(1,2,3,1,2,3),
"L2"=c(1,1,1,2,2,2))
# reorder the data to be in the right plotting order: grouping by VAR with X in ascending order, then everything follows quite nicely.
data1 <-
data %>%
arrange(VAR, X) %>%
mutate(x_loc = c(0.8, 1.1) + rep(0:2, each = 2))
data1
ggplot(data1) +
labs(x = "VAR") +
scale_x_continuous(breaks = 1:3, labels = data1$VAR[1:3])+
theme_minimal()+
theme(legend.position = "top")+
geom_pointrange(aes(
ymin = XMIN,
ymax = XMAX,
y = X,
x = x_loc,
linetype = factor(L2),
colour = factor(L1),
shape = factor(L1)))
Which results in:
Note: for some reason I do not fully understand adding additional ggplot layers after the geom_pointrange function resulted in revealing the list elements of the 'ggplot' layer. Something to follow up another time.

Smoothen Heatmap in ggplot

I have a dataframe that looks as follows:
X = c(6,6.2,6.4,6.6,6.8,5.6,5.8,6,6.2,6.4,6.6,6.8,7,7.2,7.4,7.6,7.8,8,2.8,3,3.2,3.4,3.6,3.8,4,4.2,4.4,4.6,4.8,5)
Y = c(2.2,2.2,2.2,2.2,2.2,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.8,2.8,2.8,2.8,2.8,2.8,2.8,2.8,2.8,2.8,2.8,2.8)
Value = c(0,0.00683254,0,0.007595654,0.015517884,0,0,0,0,0,0,0,0,0,0.005219395,0,0,0,0,0,0,0,0,0,0,0,0.002892342,0,0.002758141,0)
table = data.frame(X, Y, Value)
I have put together a heatmap in R, based on the following command:
ggplot(data = table, mapping = aes(x = X, y = Y)) +
geom_tile(aes(fill = Value), colour = 'black') +
theme_void() +
scale_fill_gradient2(low = "white", high = "black") + xlab(label = "X") + ylab(label = "Y")
Since there is not a value for every X and Y, it leads to plots that appear as follows.
I am attempting to smoothen the plot and have the following question:
As there are small white spaces between the plotted values, how could one color these white spaces to be the median intensity? Said differently, how would I first create an initial layer with non-zero median 'Value' before plotting the non-zero 'Value' on top (overlayed)?
A sample is shown below, which has been 'smoothed', which looks closer to the desired output.
I'm not sure if it will totally fit your need but from my understanding you have some missing values and combination of X and Y.
So, you can use complete function from tidyr to get all different combinations of X and Y (those without values will be filled with NA) and then by using na.value argument in scale_fill_gradient2 function, you can set the values of these NA values to the same color of the midpoint value:
library(tidyr)
library(dplyr)
library(ggplot2)
table %>% complete(X,Y) %>%
ggplot(aes(x = X, y = Y))+
geom_raster(aes(fill = Value), interpolate = TRUE)+
scale_fill_gradient2(low = "white", mid = "grey",high = "black",
na.value = "grey")
Does it answer your question ?

dodge columns in ggplot2

I am trying to create a picture that summarises my data. Data is about prevalence of drug use obtained from different practices form different countries. Each practice has contributed with a different amount of data and I want to show all of this in my picture.
Here is a subset of the data to work on:
gr<-data.frame(matrix(0,36))
gr$drug<-c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b")
gr$practice<-c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r")
gr$country<-c("c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3","c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3")
gr$prevalence<-c(9.14,5.53,16.74,1.93,8.51,14.96,18.90,11.18,15.00,20.10,24.56,22.29,19.41,20.25,25.01,25.87,29.33,20.76,18.94,24.60,26.51,13.37,23.84,21.82,23.69,20.56,30.53,16.66,28.71,23.83,21.16,24.66,26.42,27.38,32.46,25.34)
gr$prop<-c(0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406,0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406)
gr$low.CI<-c(8.27,4.80,12.35,1.83,7.22,14.53,18.25,10.56,14.28,18.76,24.25,21.72,18.62,19.83,24.36,25.22,28.80,20.20,17.73,23.15,21.06,13.12,21.79,21.32,22.99,19.76,29.60,15.41,28.39,23.25,20.34,24.20,25.76,26.72,31.92,24.73)
gr$high.CI<-c(10.10,6.37,22.31,2.04,10.00,15.40,19.56,11.83,15.74,21.52,24.87,22.86,20.23,20.68,25.67,26.53,29.86,21.34,20.21,26.10,32.79,13.63,26.02,22.33,24.41,21.39,31.48,17.98,29.04,24.43,22.01,25.12,27.09,28.05,33.01,25.95)
The code I wrote is this
p<-ggplot(data=gr, aes(x=factor(drug), y=as.numeric(gr$prevalence), ymax=max(high.CI),position="dodge",fill=practice,width=prop))
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
p + theme_bw()+
geom_bar(stat="identity",position = position_dodge(0.9)) +
labs(x="Drug",y="Prevalence") +
geom_errorbar(ymax=gr$high.CI,ymin=gr$low.CI,position=position_dodge(0.9),width=0.25,size=0.25,colour="black",aes(x=factor(drug), y=as.numeric(gr$prevalence), fill=practice)) +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The figure I obtain is this one where bars are all on top of each other while I want them "dodge".
I also obtain the following warning:
ymax not defined: adjusting position using y instead
Warning message:
position_dodge requires non-overlapping x intervals
Ideally I would get each bar near one another, with their error bars in the middle of its bar, all organised by country.
Also should I be concerned about the warning (which I clearly do not fully understand)?
I hope this makes sense. I hope I am close enough, but I don't seem to be going anywhere, some help would be greatly appreciated.
Thank you
ggplot's geom_bar() accepts the width parameter, but doesn't line them up neatly against one another in dodged position by default. The following workaround references the solution here:
library(dplyr)
# calculate x-axis position for bars of varying width
gr <- gr %>%
group_by(drug) %>%
arrange(practice) %>%
mutate(pos = 0.5 * (cumsum(prop) + cumsum(c(0, prop[-length(prop)])))) %>%
ungroup()
x.labels <- gr$practice[gr$drug == "a"]
x.pos <- gr$pos[gr$drug == "a"]
ggplot(gr,
aes(x = pos, y = prevalence,
fill = country, width = prop,
ymin = low.CI, ymax = high.CI)) +
geom_col(col = "black") +
geom_errorbar(size = 0.25, colour = "black") +
facet_wrap(~drug) +
scale_fill_manual(values = c("c1" = "gray79",
"c2" = "gray60",
"c3" = "gray39"),
guide = F) +
scale_x_continuous(name = "Drug",
labels = x.labels,
breaks = x.pos) +
labs(title = "Drug usage by country and practice", y = "Prevalence") +
theme_classic()
There is a lot of information you are trying to convey here - to contrast drug A and drug B across countries using the barplots and accounting for proportions, you might use the facet_grid function. Try this:
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
gr$drug <- paste("Drug", gr$drug)
p<-ggplot(data=gr, aes(x=factor(practice), y=as.numeric(prevalence),
ymax=high.CI,ymin = low.CI,
position="dodge",fill=practice, width=prop))
p + theme_bw()+ facet_grid(drug~country, scales="free") +
geom_bar(stat="identity") +
labs(x="Practice",y="Prevalence") +
geom_errorbar(position=position_dodge(0.9), width=0.25,size=0.25,colour="black") +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The width is too small in the C1 country and as you indicated the one clinic is quite influential.
Also, you can specify your aesthetics with the ggplot(aes(...)) and not have to reset it and it is not needed to include the dataframe objects name in the aes function within the ggplot call.

Lines connecting jittered points - dodging by multiple groups

I try to connect jittered points between measurements from two different methods (measure) on an x-axis. These measurements are linked to one another by the probands (a), that can be separated into two main groups, patients (pat) and controls (ctr),
My df is like that:
set.seed(1)
df <- data.frame(a = rep(paste0("id", "_", 1:20), each = 2),
value = sample(1:10, 40, rep = TRUE),
measure = rep(c("a", "b"), 20), group = rep(c("pat", "ctr"), each = 2,10))
I tried
library(ggplot2)
ggplot(df,aes(measure, value, fill = group)) +
geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0.1,
dodge.width = 0.75), shape = 1) +
geom_line(aes(group = a), position = position_dodge(0.75))
Created on 2020-01-13 by the reprex package (v0.3.0)
I used the fill aesthetic in order to separate the jittered dots from both groups (pat and ctr). I realised that when I put the group = a aesthetics into the ggplot main call, then it doesn't separate as nicely, but seems to link better to the points.
My question: Is there a way to better connect the lines to the (jittered) points, but keeping the separation of the two main groups, ctr and pat?
Thanks a lot.
The big issue you are having is that you are dodging the points by only group but the lines are being dodged by a, as well.
To keep your lines with the axes as is, one option is to manually dodge your data. This takes advantage of factors being integers under the hood, moving one level of group to the right and the other to the left.
df = transform(df, dmeasure = ifelse(group == "ctr",
as.numeric(measure) - .25,
as.numeric(measure) + .25 ) )
You can then make a plot with measure as the x axis but then use the "dodged" variable as the x axis variable in geom_point and geom_line.
ggplot(df, aes(x = measure, y = value) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
If you also want jittering, that can also be added manually to both you x and y variables.
df = transform(df, dmeasure = ifelse(group == "ctr",
jitter(as.numeric(measure) - .25, .1),
jitter(as.numeric(measure) + .25, .1) ),
jvalue = jitter(value, amount = .1) )
ggplot(df, aes(x = measure, y = jvalue) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
This turned out to be an astonishingly common question and I'd like to add an answer/comment to myself with a suggestion of a - what I now think - much, much better visualisation:
The scatter plot.
I originally intended to show paired data and visually guide the eye between the two comparisons. The problem with this visualisation is evident: Every subject is visualised twice. This leads to a quite crowded graphic. Also, the two dimensions of the data (measurement before, and after) are forced into one dimension (y), and the connection by ID is awkwardly forced onto your x axis.
Plot 1: The scatter plot naturally represents the ID by only showing one point per subject, but showing both dimensions more naturally on x and y. The only step needed is to make your data wider (yes, this is also sometimes necessary, ggplot not always requires long data).
The box plot
Plot 2: As rightly pointed out by user AllanCameron, another option would be to plot the difference of the paired values directly, for example as a boxplot. This is a nice visualisation of the appropriate paired t-test where the mean of the differences is tested against 0. It will require the same data shaping to "wide format". I personally like to show the actual values as well (if there are not too many).
library(tidyr)
library(dplyr)
library(ggplot2)
## first reshape the data wider (one column for each measurement)
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
## now use the new columns for your scatter plot
ggplot() +
geom_point(aes(time_a, time_b, color = group)) +
## you can add a line of equality to make it even more intuitive
geom_abline(intercept = 0, slope = 1, lty = 2, linewidth = .2) +
coord_equal()
Box plot to show differences of paired values
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
ggplot(aes(x = "", y = time_a - time_b)) +
geom_boxplot() +
# optional, if you want to show the actual values
geom_point(position = position_jitter(width = .1))

Resources