I am trying to create a boxplot using ggplot2, and need to have two axes from the same data frame representing two different scales. Essentially I am plotting surface area to volume ratios per two different species for three appendages, and one of the appendages has a very high SA:V ratio in comparison to the other two, which makes it difficult to have them all on the same graph.
I've recreated my data and code for the boxplot to demonstrate what I am talking about. If possible I would like the dorsal fins to be displayed on the same graph, but on a different y axis scale (that will also be shown on the graph) just so the boxes of the boxplot are all visible.
SAV <- c(seq(.35, .7, .01), seq(.09, .125, .001), seq(.09, .125, .001))
Type <- c(rep("Pectoral Fin", 36), rep("Dorsal fin", 36), rep("Fluke", 36))
Species <- c(rep(c(rep("Sp1", 18), rep("Sp2", 18)), 3))
appendage <- data.frame(SAV, Type, Species)
ggplot(aes(y = appendage$SAV,
x = factor(appendage$Type, levels = c("Dorsal fin", "Fluke")),
fill = appendage$Species),
data = appendage) +
geom_boxplot(outlier.shape = NA) +
labs(y = expression("SA:V("*cm^-1*")"), x="") +
scale_x_discrete(labels = c("PF", "DF", "F")) +
scale_fill_manual(values = c("black", "gray"))
If any one could help me with this that would be great!
One possibility is to use facet_wrap.
appendage %>%
mutate(
Type = factor(Type,
levels = c("Dorsal fin", "Fluke", "Pectoral Fin"),
labels = c("DF", "PF", "F"))) %>%
ggplot(aes(Type, SAV, fill = Species)) +
geom_boxplot(outlier.shape=NA) +
labs(y=expression("SA:V("*cm^-1*")"),x="") +
scale_fill_manual(values=c("black","gray")) +
facet_wrap(~Type, scales="free") +
theme(axis.ticks.x = element_blank(),
strip.background = element_blank(),
strip.text.x = element_blank())
First off, like what others have commented, I do not recommend this type of plot. Dual axes have a tendency to make comparisons harder, & visually confuse the audience even when they are aware of it.
That said, it is possible to achieve this using ggplot2, & I'll show one approach below, once we get past several other issues in the original code:
Issue 1: You are passing a data frame to ggplot(). The dollar sign $ has no place in aes() in such cases.
Instead of:
ggplot(aes(y = appendage$SAV,
x = factor(appendage$Type), # ignore the levels for now; see next issue
fill = appendage$Species),
data = appendage) +
...
Use:
ggplot(aes(y = SAV,
x = factor(Type),
fill = Species),
data = appendage) +
...
Issue 2: Which appendage has the extraordinarily high SA:V?
From the code used to generate the sample dataset, it should be "Pectoral Fin", but the final result shows "DF". I assume the mapping between full terms & axis labels to be:
"Pectoral Fin" -> "PF"
"Dorsal fin" -> "DF"
"Fin" -> "F"
... so this looks like a slip up between passing Type as a factor to the x parameter in aes(), and setting the axis labels in scale_x_discrete().
Since you're using factor(), it would be neater to set the labels there as well. Keeping it in the same place makes such things easier to spot.
Instead of:
ggplot(aes(y = SAV,
x = factor(Type, levels = c("Dorsal fin", "Fluke")),
fill = Species),
data = appendage) +
...
Use:
ggplot(aes(y = SAV,
x = factor(Type,
levels = c("Dorsal fin", "Fluke", "Pectoral Fin"),
labels = c("DF", "F", "PF")),
fill = Species),
data = appendage) +
...
I switched the order of factors as I feel it makes (marginally) more sense visually for the x-axis category corresponding to the secondary y-axis (typically on the right) to be on the right of other x-axis categories. You can change that if this isn't the desired case. Just make sure both levels = ... and labels = ... are changed together.
Solution for secondary y-axis
Manually re-scale the values of the offending appendage (whichever fin that turns out to be) until its range is somewhat similar to that of other appendages. (In the example below, I used a simple division of y / 5, but more complicated functions can be used too.)
Specify the sec.axis() option for the y-axis, using the inverse of the re-scaling function as the transformation. (In this case y * 5.)
Label the original y-axis (left) and the secondary y-axis (right) accordingly to make it clear which appendage(s) each axis's scale applies to.
Final code + result:
k = 5 #rescale factor
ggplot(aes(y = ifelse(Type == "Pectoral Fin",
SAV / k, SAV),
x = factor(Type,
levels = c("Dorsal fin", "Fluke", "Pectoral Fin"),
labels = c("DF", "F", "PF")),
fill = Species),
data = appendage) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(sec.axis = sec_axis(trans = ~. * k,
name = expression("SA:V ("*cm^-1*") PF"))) +
labs(y = expression("SA:V ("*cm^-1*") DF / F"), x = "") +
scale_fill_manual(values = c("black", "gray"))
Related
Using ggplot2, I'm attempting to reorder a data representation with 3 factors: condition, sex, and time.
library(ggplot2)
library(dplyr)
DF <- data.frame(value = rnorm(100, 20, sd = 0.1),
cond = c(rep("a",25),rep("b",25),rep("a",25),rep("b",25)),
sex = c(rep("M",50),rep("F",50)),
time = rep(c("1","2"),50)
)
ggplot(data=DF, aes( x = time,
y = value,
fill = cond,
colour = sex,
)
) +
geom_boxplot(size = 1, outlier.shape = NA) +
scale_fill_manual(values=c("#69b3a2", "#404080")) +
scale_color_manual(values=c("grey10", "grey40")) +
ggtitle("aF,aM,bF,bM") +
theme(legend.position = "top")
Badly ordered plot.
The way ggplot2 automatically orders condition first and interleaves sex poses the issue. It defaults to an interleaved "aF,aM,bF,bM" order regardless of which factor I assign to which aesthetic.
For analysis purposes, my preferred order is "aM,bM,aF,bF". Order sex first and interleave condition. I tried to fix it by converting the 2x2 factor assignments to one group with 4 levels, which gives me complete control over the order:
DF %>% mutate(grp = as.factor(paste0(cond,sex))) -> DF
level_order <- c("aM", "bM", "aF", "bF")
ggplot(data=DF, aes( x = time,
y = value,
fill = factor(grp, level=level_order),
colour = sex
)
) +
geom_boxplot(size = 1, outlier.shape = NA) +
scale_fill_manual(values=c("#69b3a2", "#404080","#69b3a2", "#404080")) +
scale_color_manual(values=c("grey10", "grey40", "grey40", "grey10")) +
ggtitle("aM,bM,aF,bF") +
theme(legend.position = "top")
Ordering OK, bad representation.
However artificial grouping like this has its downsides, subjects are not assigned to a group, they are male/female (can't be changed) and assigned to some condition. Also the plot legend is unnecessarily cluttered, it has 6 keys instead of 4. It doesn't convey that it's 2x2 repeated measures design all that well.
I'm not sure if what I'm trying to do makes sense (I hope this isn't some massive brain fart), any help would be appreciated.
The order in which you place the aesthetics controls the priority of its groupings. Thus if you switch the position of fill and colour you will get the result you are looking for (e.i. you want colour to be grouped first, and then fill)
ggplot(data=DF, aes( x = time,
y = value,
colour = sex,
fill = cond)) +
geom_boxplot(size = 1, outlier.shape = NA) +
scale_fill_manual(values=c("#69b3a2", "#404080")) +
scale_color_manual(values=c("grey10", "grey40")) +
theme(legend.position = "top")
I'm trying to figure out two problems in R ggplot:
Show only data labels for every N day/data point
Highlight (make the line bigger and/or dotted) for a specific variable
My code is below:
gplot(data = sales,
aes(x = dates, y = volume, colour = Country, size = ifelse(Country=="US", 1, 0.5) group = Country)) +
geom_line() +
geom_point() +
geom_text(data = sales, aes(label=volume), size=3, vjust = -0.5)
I can't find out a way how to space the data labels as currently they are being shown for each data point per every day and it's very hard to read the plot.
As for #2, unfortunately, the size with ifelse doesn't work as 'US' line is becoming super huge and I can't change that size not matter what I specify in the first parameter of ifelse.
Would appreciate any help!
As no data was provided the solution is probably not perfect, but nonetheless shows you the general approach. Try this:
sales_plot <- sales %>%
# Create label
# e.g. assuming dates are in Date-Format labels are "only" created for even days
mutate(label = ifelse(lubridate::day(dates) %% 2 == 0, volume, ""))
ggplot(data = sales_plot,
# To adjust the size: Simply set labels. The actual size is set in scale_size_manual
aes(x = dates, y = volume, colour = Country, size = ifelse(Country == "US", "US", "other"), group = Country)) +
geom_line() +
geom_point() +
geom_text(aes(label = label), size = 3, vjust = -0.5) +
# Set the size according the labels
scale_size_manual(values = c(US = 2, other = .9))
I have been trying to plot a graph of two sets of data with different point symbols and connecting lines with different colors using the R package ggplot2, but for the life of me, I have not been able to get the legend correctly distinguish between the two curves by showing the associated data point symbol for each curve.
I can get the legend to show different line colors. But I have not been able to make the legend to show different data point symbols for each set of data.
The following code:
df <- data.frame( thrd_cnt=c(1,2,4,8,16),
runtime4=c(53,38,31,41,54),
runtime8=c(54,35,31,35,44))
library("ggplot2")
print(
ggplot(data = df, aes(df$thrd_cnt, y=df$runtime, color=)) +
geom_line(aes(y=df$runtime4, color = "4 cores")) +
geom_point(aes(y=df$runtime4, color = "4 cores"), fill = "white",
size = 3, shape = 21) +
geom_line(aes(y=df$runtime8, color = "8 cores")) +
geom_point(aes(y=df$runtime8, color = "8 cores"), fill = "white",
size = 3, shape = 23) +
xlab("Number of Threads") +
ylab(substitute(paste("Execution Time, ", italic(milisec)))) +
scale_x_continuous(breaks=c(1,2,4,8,16)) +
theme(legend.position = c(0.3, 0.8)) +
labs(color="# cores")
)
## save a pdf and a png
ggsave("runtime.pdf", width=5, height=3.5)
ggsave("runtime.png", width=5, height=3.5)
outputs this graph:
plot
But the data point symbols in the legend are not distinguishable. The legend shows the same symbol for both graphs (which is formed of both data point symbols on top of each other).
One possible solution is to define the number of threads as a factor, then I might be able to get the data point symbols on the legend right, but still I don't know how to do that.
Any help would be appreciated.
As noted, you need to gather the data into a long format so you can map the cores variable to colour and shape. To keep the same choices of shape and fill as in your original plot, use scale_shape_manual to set the shape corresponding to each level of cores. Note that you need to set the name for both the colour and shape legends in labs() to ensure they coincide and don't produce two legends. I also used mutate so that the levels of cores don't confusingly include the word runtime.
df <- data.frame( thrd_cnt=c(1,2,4,8,16),
runtime4=c(53,38,31,41,54),
runtime8=c(54,35,31,35,44))
library(tidyverse)
ggplot(
data = df %>%
gather(cores, runtime, runtime4, runtime8) %>%
mutate(cores = str_c(str_extract(cores, "\\d"), " cores")),
mapping = aes(x = thrd_cnt, y = runtime, colour = cores)
) +
geom_line() +
geom_point(aes(shape = cores), size = 3, fill = "white") +
scale_x_continuous(breaks = c(1, 2, 4, 8, 16)) +
scale_shape_manual(values = c("4 cores" = 21, "8 cores" = 23)) +
theme(legend.position = c(0.3, 0.8)) +
labs(
x = "Number of Threads",
y = "Execution Time (millisec)",
colour = "# cores",
shape = "# cores"
)
Created on 2018-04-10 by the reprex package (v0.2.0).
or shape is fine too, and if you're doing more stuff with df, might make sense to convert and keep it in long, 'tidy' format.
library("ggplot2")
df <- data.frame( thrd_cnt=c(1,2,4,8,16),
runtime4=c(53,38,31,41,54),
runtime8=c(54,35,31,35,44))
df <- df %>% gather("runtime", "millisec", 2:3)
ggplot(data = df, aes(x = thrd_cnt, y = millisec, color = runtime, shape =
runtime)) + geom_line() + geom_point()
after gathering into a "long" formatted data frame, you pass colour and shape (pch) to the aesthetics arguments:
library(tidyverse)
df <- data.frame( thrd_cnt=c(1,2,4,8,16),
runtime4=c(53,38,31,41,54),
runtime8=c(54,35,31,35,44))
df %>% gather(key=run, value=time, -thrd_cnt) %>%
ggplot(aes(thrd_cnt, time, pch=run, colour=run)) + geom_line() + geom_point()
(Notice how brief the code is, compared to the original post)
I am trying to create a picture that summarises my data. Data is about prevalence of drug use obtained from different practices form different countries. Each practice has contributed with a different amount of data and I want to show all of this in my picture.
Here is a subset of the data to work on:
gr<-data.frame(matrix(0,36))
gr$drug<-c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b")
gr$practice<-c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r")
gr$country<-c("c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3","c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3")
gr$prevalence<-c(9.14,5.53,16.74,1.93,8.51,14.96,18.90,11.18,15.00,20.10,24.56,22.29,19.41,20.25,25.01,25.87,29.33,20.76,18.94,24.60,26.51,13.37,23.84,21.82,23.69,20.56,30.53,16.66,28.71,23.83,21.16,24.66,26.42,27.38,32.46,25.34)
gr$prop<-c(0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406,0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406)
gr$low.CI<-c(8.27,4.80,12.35,1.83,7.22,14.53,18.25,10.56,14.28,18.76,24.25,21.72,18.62,19.83,24.36,25.22,28.80,20.20,17.73,23.15,21.06,13.12,21.79,21.32,22.99,19.76,29.60,15.41,28.39,23.25,20.34,24.20,25.76,26.72,31.92,24.73)
gr$high.CI<-c(10.10,6.37,22.31,2.04,10.00,15.40,19.56,11.83,15.74,21.52,24.87,22.86,20.23,20.68,25.67,26.53,29.86,21.34,20.21,26.10,32.79,13.63,26.02,22.33,24.41,21.39,31.48,17.98,29.04,24.43,22.01,25.12,27.09,28.05,33.01,25.95)
The code I wrote is this
p<-ggplot(data=gr, aes(x=factor(drug), y=as.numeric(gr$prevalence), ymax=max(high.CI),position="dodge",fill=practice,width=prop))
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
p + theme_bw()+
geom_bar(stat="identity",position = position_dodge(0.9)) +
labs(x="Drug",y="Prevalence") +
geom_errorbar(ymax=gr$high.CI,ymin=gr$low.CI,position=position_dodge(0.9),width=0.25,size=0.25,colour="black",aes(x=factor(drug), y=as.numeric(gr$prevalence), fill=practice)) +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The figure I obtain is this one where bars are all on top of each other while I want them "dodge".
I also obtain the following warning:
ymax not defined: adjusting position using y instead
Warning message:
position_dodge requires non-overlapping x intervals
Ideally I would get each bar near one another, with their error bars in the middle of its bar, all organised by country.
Also should I be concerned about the warning (which I clearly do not fully understand)?
I hope this makes sense. I hope I am close enough, but I don't seem to be going anywhere, some help would be greatly appreciated.
Thank you
ggplot's geom_bar() accepts the width parameter, but doesn't line them up neatly against one another in dodged position by default. The following workaround references the solution here:
library(dplyr)
# calculate x-axis position for bars of varying width
gr <- gr %>%
group_by(drug) %>%
arrange(practice) %>%
mutate(pos = 0.5 * (cumsum(prop) + cumsum(c(0, prop[-length(prop)])))) %>%
ungroup()
x.labels <- gr$practice[gr$drug == "a"]
x.pos <- gr$pos[gr$drug == "a"]
ggplot(gr,
aes(x = pos, y = prevalence,
fill = country, width = prop,
ymin = low.CI, ymax = high.CI)) +
geom_col(col = "black") +
geom_errorbar(size = 0.25, colour = "black") +
facet_wrap(~drug) +
scale_fill_manual(values = c("c1" = "gray79",
"c2" = "gray60",
"c3" = "gray39"),
guide = F) +
scale_x_continuous(name = "Drug",
labels = x.labels,
breaks = x.pos) +
labs(title = "Drug usage by country and practice", y = "Prevalence") +
theme_classic()
There is a lot of information you are trying to convey here - to contrast drug A and drug B across countries using the barplots and accounting for proportions, you might use the facet_grid function. Try this:
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
gr$drug <- paste("Drug", gr$drug)
p<-ggplot(data=gr, aes(x=factor(practice), y=as.numeric(prevalence),
ymax=high.CI,ymin = low.CI,
position="dodge",fill=practice, width=prop))
p + theme_bw()+ facet_grid(drug~country, scales="free") +
geom_bar(stat="identity") +
labs(x="Practice",y="Prevalence") +
geom_errorbar(position=position_dodge(0.9), width=0.25,size=0.25,colour="black") +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The width is too small in the C1 country and as you indicated the one clinic is quite influential.
Also, you can specify your aesthetics with the ggplot(aes(...)) and not have to reset it and it is not needed to include the dataframe objects name in the aes function within the ggplot call.
I try to connect jittered points between measurements from two different methods (measure) on an x-axis. These measurements are linked to one another by the probands (a), that can be separated into two main groups, patients (pat) and controls (ctr),
My df is like that:
set.seed(1)
df <- data.frame(a = rep(paste0("id", "_", 1:20), each = 2),
value = sample(1:10, 40, rep = TRUE),
measure = rep(c("a", "b"), 20), group = rep(c("pat", "ctr"), each = 2,10))
I tried
library(ggplot2)
ggplot(df,aes(measure, value, fill = group)) +
geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0.1,
dodge.width = 0.75), shape = 1) +
geom_line(aes(group = a), position = position_dodge(0.75))
Created on 2020-01-13 by the reprex package (v0.3.0)
I used the fill aesthetic in order to separate the jittered dots from both groups (pat and ctr). I realised that when I put the group = a aesthetics into the ggplot main call, then it doesn't separate as nicely, but seems to link better to the points.
My question: Is there a way to better connect the lines to the (jittered) points, but keeping the separation of the two main groups, ctr and pat?
Thanks a lot.
The big issue you are having is that you are dodging the points by only group but the lines are being dodged by a, as well.
To keep your lines with the axes as is, one option is to manually dodge your data. This takes advantage of factors being integers under the hood, moving one level of group to the right and the other to the left.
df = transform(df, dmeasure = ifelse(group == "ctr",
as.numeric(measure) - .25,
as.numeric(measure) + .25 ) )
You can then make a plot with measure as the x axis but then use the "dodged" variable as the x axis variable in geom_point and geom_line.
ggplot(df, aes(x = measure, y = value) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
If you also want jittering, that can also be added manually to both you x and y variables.
df = transform(df, dmeasure = ifelse(group == "ctr",
jitter(as.numeric(measure) - .25, .1),
jitter(as.numeric(measure) + .25, .1) ),
jvalue = jitter(value, amount = .1) )
ggplot(df, aes(x = measure, y = jvalue) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
This turned out to be an astonishingly common question and I'd like to add an answer/comment to myself with a suggestion of a - what I now think - much, much better visualisation:
The scatter plot.
I originally intended to show paired data and visually guide the eye between the two comparisons. The problem with this visualisation is evident: Every subject is visualised twice. This leads to a quite crowded graphic. Also, the two dimensions of the data (measurement before, and after) are forced into one dimension (y), and the connection by ID is awkwardly forced onto your x axis.
Plot 1: The scatter plot naturally represents the ID by only showing one point per subject, but showing both dimensions more naturally on x and y. The only step needed is to make your data wider (yes, this is also sometimes necessary, ggplot not always requires long data).
The box plot
Plot 2: As rightly pointed out by user AllanCameron, another option would be to plot the difference of the paired values directly, for example as a boxplot. This is a nice visualisation of the appropriate paired t-test where the mean of the differences is tested against 0. It will require the same data shaping to "wide format". I personally like to show the actual values as well (if there are not too many).
library(tidyr)
library(dplyr)
library(ggplot2)
## first reshape the data wider (one column for each measurement)
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
## now use the new columns for your scatter plot
ggplot() +
geom_point(aes(time_a, time_b, color = group)) +
## you can add a line of equality to make it even more intuitive
geom_abline(intercept = 0, slope = 1, lty = 2, linewidth = .2) +
coord_equal()
Box plot to show differences of paired values
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
ggplot(aes(x = "", y = time_a - time_b)) +
geom_boxplot() +
# optional, if you want to show the actual values
geom_point(position = position_jitter(width = .1))