I’m totally new to ggplot, relatively fresh with R and want to make a smashing ”before-and-after” scatterplot with connecting lines to illustrate the movement in percentages of different subgroups before and after a special training initiative. I’ve tried some options, but have yet to:
show each individual observation separately (now same values are overlapping)
connect the related before and after measures (x=0 and X=1) with lines to more clearly illustrate the direction of variation
subset the data along class and id using shape and colors
How can I best create a scatter plot using ggplot (or other) fulfilling the above demands?
Main alternative: geom_point()
Here is some sample data and example code using genom_point
x <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1) # 0=before, 1=after
y <- c(45,30,10,40,10,NA,30,80,80,NA,95,NA,90,NA,90,70,10,80,98,95) # percentage of ”feelings of peace"
class <- c(0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1) # 0=multiple days 1=one day
id <- c(1,1,2,3,4,4,4,4,5,6,1,1,2,3,4,4,4,4,5,6) # id = per individual
df <- data.frame(x,y,class,id)
ggplot(df, aes(x=x, y=y), fill=id, shape=class) + geom_point()
Alternative: scale_size()
I have explored stat_sum() to summarize the frequencies of overlapping observations, but then not being able to subset using colors and shapes due to overlap.
ggplot(df, aes(x=x, y=y)) +
stat_sum()
Alternative: geom_dotplot()
I have also explored geom_dotplot() to clarify the overlapping observations that arise from using genom_point() as I do in the example below, however I have yet to understand how to combine the before and after measures into the same plot.
df1 <- df[1:10,] # data before
df2 <- df[11:20,] # data after
p1 <- ggplot(df1, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
p2 <- ggplot(df2, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
grid.arrange(p1,p2, nrow=1) # GridExtra package
Or maybe it is better to summarize data by x, id, class as mean/median of y, filter out ids producing NAs (e.g. ids 3 and 6), and connect the points by lines? So in case if you don't really need to show variability for some ids (which could be true if the plot only illustrates tendencies) you can do it this way:
library(ggplot)
library(dplyr)
#library(ggthemes)
df <- df %>%
group_by(x, id, class) %>%
summarize(y = median(y, na.rm = T)) %>%
ungroup() %>%
mutate(
id = factor(id),
x = factor(x, labels = c("before", "after")),
class = factor(class, labels = c("one day", "multiple days")),
) %>%
group_by(id) %>%
mutate(nas = any(is.na(y))) %>%
ungroup() %>%
filter(!nas) %>%
select(-nas)
ggplot(df, aes(x = x, y = y, col = id, group = id)) +
geom_point(aes(shape = class)) +
geom_line(show.legend = F) +
#theme_few() +
#theme(legend.position = "none") +
ylab("Feelings of peace, %") +
xlab("")
Here's one possible solution for you.
First - to get the color and shapes determined by variables, you need to put these into the aes function. I turned several into factors, so the labs function fixes the labels so they don't appear as "factor(x)" but just "x".
To address multiple points, one solution is to use geom_smooth with method = "lm". This plots the regression line, instead of connecting all the dots.
The option se = FALSE prevents confidence intervals from being plotted - I don't think they add a lot to your plot, but play with it.
Connecting the dots is done by geom_line - feel free to try that as well.
Within geom_point, the option position = position_jitter(width = .1) adds random noise to the x-axis so points do not overlap.
ggplot(df, aes(x=factor(x), y=y, color=factor(id), shape=factor(class), group = id)) +
geom_point(position = position_jitter(width = .1)) +
geom_smooth(method = 'lm', se = FALSE) +
labs(
x = "x",
color = "ID",
shape = 'Class'
)
I am trying to create a picture that summarises my data. Data is about prevalence of drug use obtained from different practices form different countries. Each practice has contributed with a different amount of data and I want to show all of this in my picture.
Here is a subset of the data to work on:
gr<-data.frame(matrix(0,36))
gr$drug<-c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b")
gr$practice<-c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r")
gr$country<-c("c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3","c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3")
gr$prevalence<-c(9.14,5.53,16.74,1.93,8.51,14.96,18.90,11.18,15.00,20.10,24.56,22.29,19.41,20.25,25.01,25.87,29.33,20.76,18.94,24.60,26.51,13.37,23.84,21.82,23.69,20.56,30.53,16.66,28.71,23.83,21.16,24.66,26.42,27.38,32.46,25.34)
gr$prop<-c(0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406,0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406)
gr$low.CI<-c(8.27,4.80,12.35,1.83,7.22,14.53,18.25,10.56,14.28,18.76,24.25,21.72,18.62,19.83,24.36,25.22,28.80,20.20,17.73,23.15,21.06,13.12,21.79,21.32,22.99,19.76,29.60,15.41,28.39,23.25,20.34,24.20,25.76,26.72,31.92,24.73)
gr$high.CI<-c(10.10,6.37,22.31,2.04,10.00,15.40,19.56,11.83,15.74,21.52,24.87,22.86,20.23,20.68,25.67,26.53,29.86,21.34,20.21,26.10,32.79,13.63,26.02,22.33,24.41,21.39,31.48,17.98,29.04,24.43,22.01,25.12,27.09,28.05,33.01,25.95)
The code I wrote is this
p<-ggplot(data=gr, aes(x=factor(drug), y=as.numeric(gr$prevalence), ymax=max(high.CI),position="dodge",fill=practice,width=prop))
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
p + theme_bw()+
geom_bar(stat="identity",position = position_dodge(0.9)) +
labs(x="Drug",y="Prevalence") +
geom_errorbar(ymax=gr$high.CI,ymin=gr$low.CI,position=position_dodge(0.9),width=0.25,size=0.25,colour="black",aes(x=factor(drug), y=as.numeric(gr$prevalence), fill=practice)) +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The figure I obtain is this one where bars are all on top of each other while I want them "dodge".
I also obtain the following warning:
ymax not defined: adjusting position using y instead
Warning message:
position_dodge requires non-overlapping x intervals
Ideally I would get each bar near one another, with their error bars in the middle of its bar, all organised by country.
Also should I be concerned about the warning (which I clearly do not fully understand)?
I hope this makes sense. I hope I am close enough, but I don't seem to be going anywhere, some help would be greatly appreciated.
Thank you
ggplot's geom_bar() accepts the width parameter, but doesn't line them up neatly against one another in dodged position by default. The following workaround references the solution here:
library(dplyr)
# calculate x-axis position for bars of varying width
gr <- gr %>%
group_by(drug) %>%
arrange(practice) %>%
mutate(pos = 0.5 * (cumsum(prop) + cumsum(c(0, prop[-length(prop)])))) %>%
ungroup()
x.labels <- gr$practice[gr$drug == "a"]
x.pos <- gr$pos[gr$drug == "a"]
ggplot(gr,
aes(x = pos, y = prevalence,
fill = country, width = prop,
ymin = low.CI, ymax = high.CI)) +
geom_col(col = "black") +
geom_errorbar(size = 0.25, colour = "black") +
facet_wrap(~drug) +
scale_fill_manual(values = c("c1" = "gray79",
"c2" = "gray60",
"c3" = "gray39"),
guide = F) +
scale_x_continuous(name = "Drug",
labels = x.labels,
breaks = x.pos) +
labs(title = "Drug usage by country and practice", y = "Prevalence") +
theme_classic()
There is a lot of information you are trying to convey here - to contrast drug A and drug B across countries using the barplots and accounting for proportions, you might use the facet_grid function. Try this:
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
gr$drug <- paste("Drug", gr$drug)
p<-ggplot(data=gr, aes(x=factor(practice), y=as.numeric(prevalence),
ymax=high.CI,ymin = low.CI,
position="dodge",fill=practice, width=prop))
p + theme_bw()+ facet_grid(drug~country, scales="free") +
geom_bar(stat="identity") +
labs(x="Practice",y="Prevalence") +
geom_errorbar(position=position_dodge(0.9), width=0.25,size=0.25,colour="black") +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The width is too small in the C1 country and as you indicated the one clinic is quite influential.
Also, you can specify your aesthetics with the ggplot(aes(...)) and not have to reset it and it is not needed to include the dataframe objects name in the aes function within the ggplot call.
Trying to plot a stacked histogram using ggplot:
set.seed(1)
my.df <- data.frame(param = runif(10000,0,1),
x = runif(10000,0.5,1))
my.df$param.range <- cut(my.df$param, breaks = 5)
require(ggplot2)
not logging the y-axis:
ggplot(my.df,aes_string(x = "x", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey()
gives:
But I want to log10+1 transform the y-axis to make it easier to read:
ggplot(my.df, aes_string(x = "x", y = "..count..+1", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey() +
scale_y_log10()
which gives:
The tick marks on the y-axis don't make sense.
I get the same behavior if I log10 transform rather than log10+1:
ggplot(my.df, aes_string(x = "x", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey() +
scale_y_log10()
Any idea what is going on?
It looks like invoking scale_y_log10 with a stacked histogram is causing ggplot to plot the product of the counts for each component of the stack within each x bin. Below is a demonstration. We create a data frame called product.of.counts that contains the product, within each x bin of the counts for each param.range bin. We use geom_text to add those values to the plot and see that they coincide with the top of each stack of histogram bars.
At first I thought this was a bug, but after a bit of searching, I was reminded of the way ggplot does the log transformation. As described in the linked answer, "scale_y_log10 makes the counts, converts them to logs, stacks those logs, and then displays the scale in the anti-log form. Stacking logs, however, is not a linear transformation, so what you have asked it to do does not make any sense."
As a simpler example, say each of five components of a stacked bar have a count of 100. Then log10(100) = 2 for all five and the sum of the logs will be 10. Then ggplot takes the anti-log for the scale, which gives 10^10 for the total height of the bar (which is 100^5), even though the actual height is 100x5=500. This is exactly what's happening with your plot.
library(dplyr)
library(ggplot2)
# Data
set.seed(1)
my.df <- data.frame(param=runif(10000,0,1),x=runif(10000,0.5,1))
my.df$param.range <- cut(my.df$param,breaks=5)
# Calculate product of counts within each x bin
product.of.counts = my.df %>%
group_by(param.range, breaks=cut(x, breaks=seq(-0.05, 1.05, 0.1), labels=seq(0,1,0.1))) %>%
tally %>%
group_by(breaks) %>%
summarise(prod = prod(n),
param.range=NA) %>%
ungroup %>%
mutate(breaks = as.numeric(as.character(breaks)))
ggplot(my.df, aes(x, fill=param.range)) +
geom_histogram(binwidth = 0.1, colour="grey30") +
scale_fill_grey() +
scale_y_log10(breaks=10^(0:14)) +
geom_text(data=product.of.counts, size=3.5,
aes(x=breaks, y=prod, label=format(prod, scientific=TRUE, digits=3)))