Overlay violin plots in r

Overlay violin plots in r - r

I am trying to plot overlaying violin plots by condition within the same variable.
Var <- rnorm(100,50)
Cond <- rbinom(100, 1, 0.5)
df2 <- data.frame(Var,Cond)
ggplot(df2)+
aes(x=factor(Cond),y=Var, colour = Cond)+
geom_violin(alpha=0.3,position="identity")+
coord_flip()
So, where do I specify that I want them to overlap? Preferably, I want them to become more lighter when overlapping and darker colour when not so that their differences are clear. Any clues?

If you don't want them to have different (flipped) x-values, set x to a constant instead of x = factor(Cond). And if you want them filled in, set a fill aesthetic.
ggplot(df2)+
aes(x=0,y=Var, colour = Cond, fill = Cond)+
geom_violin(alpha=0.3,position="identity")+
coord_flip()
coord_flip isn't often needed anymore--since version 3.3.0 (released in early 2020) all geoms can point in either direction. I'd recommend simplifying as below for a similar result.
df2$Cond = factor(df2$Cond)
ggplot(df2) +
aes(y = 0, x = Var, colour = Cond, fill = Cond) +
geom_violin(alpha = 0.3, position = "identity")

Related

R code of scatter plot for three variables

Hi I am trying to code for a scatter plot for three variables in R:
Race= [0,1]
YOI= [90,92,94]
ASB_mean = [1.56, 1.59, 1.74]
Antisocial <- read.csv(file = 'Antisocial.csv')
Table_1 <- ddply(Antisocial, "YOI", summarise, ASB_mean = mean(ASB))
Table_1
Race <- unique(Antisocial$Race)
Race
ggplot(data = Table_1, aes(x = YOI, y = ASB_mean, group_by(Race))) +
geom_point(colour = "Black", size = 2) + geom_line(data = Table_1, aes(YOI,
ASB_mean), colour = "orange", size = 1)
Image of plot: https://drive.google.com/file/d/1E-ePt9DZJaEr49m8fguHVS0thlVIodu9/view?usp=sharing
Data file: https://drive.google.com/file/d/1UeVTJ1M_eKQDNtvyUHRB77VDpSF1ASli/view?usp=sharing
Can someone help me understand where I am making mistake? I want to plot mean ASB vs YOI grouped by Race. Thanks.

I am not sure what is your desidered output. Maybe, if I well understood your question I Think that you want somthing like this.
g_Antisocial <- Antisocial %>%
group_by(Race) %>%
summarise(ASB = mean(ASB),
YOI = mean(YOI))
Antisocial %>%
ggplot(aes(x = YOI, y = ASB, color = as_factor(Race), shape = as_factor(Race))) +
geom_point(alpha = .4) +
geom_point(data = g_Antisocial, size = 4) +
theme_bw() +
guides(color = guide_legend("Race"), shape = guide_legend("Race"))
and this is the output:

#Maninder: there are a few things you need to look at.
First of all: The grammar of graphics of ggplot() works with layers. You can add layers with different data (frames) for the different geoms you want to plot.
The reason why your code is not working is that you mix the layer call and or do not really specify (and even mix) what is the scatter and line visualisation you want.
(I) Use ggplot() + geom_point() for a scatter plot
The ultimate first layer is: ggplot(). Think of this as your drawing canvas.
You then speak about adding a scatter plot layer, but you actually do not do it.
For example:
# plotting antisocal data set
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race)))
will plot your Antiscoial data set using the scatter, i.e. geom_point() layer.
Note that I put Race as a factor to have a categorical colour scheme otherwise you might end up with a continous palette.
(II) line plot
In analogy to above, you would get for the line plot the following:
# plotting Table_1
ggplot() +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean))
I save showing the plot of the line.
(III) combining different layers
# putting both together
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race))) +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean)) +
## this is to set the legend title and have a nice(r) name in your colour legend
labs(colour = "Race")
This yields:
That should explain how ggplot-layering works. Keep an eye on the datasets and geoms that you want to use. Before working with inheritance in aes, I recommend to keep the data= and aes() call in the geom_xxxx. This avoids confustion.
You may want to explore with geom_jitter() instead of geom_point() to get a bit of a better presentation of your dataset. The "few" points plotted are the result of many datapoints in the same position (and overplotted).
Moving away from plotting to your question "I want to plot mean ASB vs YOI grouped by Race."
I know too little about your research to fully comprehend what you mean with that.
I take it that the mean ASB you calculated over the whole population is your reference (aka your Table_1), and you would like to see how the Race groups feature vs this population mean.
One option is to group your race data points and show them as boxplots for each YOI.
This might be what you want. The boxplot gives you the median and quartiles, and you can compare this per group against the calculated ASB mean.
For presentation purposes, I highlighted the line by increasing its size and linetype. You can play around with the colours, etc. to give you the aesthetics you aim for.
Please note, that for the grouped boxplot, you also have to treat your integer variable YOI, I coerced into a categorical factor. Boxplot works with fill for the body (colour sets only the outer line). In this setup, you also need to supply a group value to geom_line() (I just assigned it to 1, but that is arbitrary - in other contexts you can assign another variable here).
ggplot() +
geom_boxplot(data = Antisocial, aes(x = as.factor(YOI), y = ASB, fill = as.factor(Race))) +
geom_line(data = Table_1, aes(x = as.factor(YOI), y = ASB_mean, group = 1)
, size = 2, linetype = "dashed") +
labs(x = "YOI", fill = "Race")
Hope this gets you going!

how to add legends from stat_summary and remove legends from the main plot?

I want to plot the values of df1 by two groups i.e. product and start_date and also plot a crossbar with the mean of df1(blue) and mean of df2(red) as in the attached diagram.
df1 <- data.frame(product = c("A","A","A","A","A","A","A","B","B","B","B","B","B","B","C","C","C","C","C","C","C","D","D","D","D","D","D","D"),
start_date =as.Date(c('2020-02-01', '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05', '2020-02-06', '2020-02-07')),
value = c(15.71,17.37,19.93,14.28,15.85,10.5,8.58,5.62,5.19,5.44,4.6,7.04,6.29,3.3,20.35,27.92,23.07,12.83,22.28,21.32,31.46,34.82,23.68,29.11,14.48,25.2,16.91,27.79))
df2 <- data.frame(product = c("A","A","A","A","A","A","B","B","B","B","B","B","C","C","C","C","C","C","D","D","D","D","D","D"),
start_date =as.Date(c('2019-07-09', '2019-07-10', '2019-07-11', '2019-07-12', '2019-07-13', '2019-07-14')),
value = c(9.06,10.74,14.64,7.67,8.72,11.21,4.76,4.53,3.81,4.32,3.95,5.2,20.36,21.17,19.51,16.25,17.93,16.94,14.51,14.65,23.28,10.84,16.71,12.48))
PLOT GRAPH
graph1 <- ggplot(df1, aes(
y = value, x = product, fill = product, color = factor(start_date))) +
geom_col(data = df1, stat = "identity",position = position_dodge(width = 0.8), width = 0.7, inherit.aes = TRUE, size = 0) +
xlab("Product") + ylab("Values") + ylim(c(0,40)) +
scale_fill_manual(values=c("#008FCC", "#FFAA00", "#E60076", "#B00000")) +
stat_summary(data = df1, aes(x = factor(product),y = value),fun = "mean",geom = "crossbar", color = "blue", size = 1, width = 0.8, inherit.aes = FALSE) +
stat_summary(data = df2, aes(x = factor(product),y = value),fun = "mean",geom = "crossbar", color = "red", size = 1, width = 0.8, inherit.aes = FALSE)
Is there any way to remove the borders of the bar plots and add legend of the two crossbars at the top right corner of the plot ?
Additionally I would like to know if is there a way to add the just the "date" from df1 below each bar in the plot ?

Your question about adjusting the plot has multiple parts. To summarize a few points:
Change from color=factor(start_date) to group= to remove the color around bars, but maintain the separation of individual bars by start_date
Use theme(legend.position=... and specify precise placement of legend within plot area. Use theme(legend.direction='horizontal') too when appropriate.
Add color= attribute into the stat_summary(geom='crossbar'...) calls in order to "add" them both to a legend, then use scale_color_manual to specify color if you don't like the default.
Minor suggestion: Use ylim(X,Y) instead of ylim(c(X,Y)). It's not necessary to put the limits into a vector, since ylim can accept that instead and it's simpler. Note that it still works either way, so that's why this point is minor.
You don't need the data=df1 for the first stat_summary call, since it's the default mapping based on the data= value set in ggplot(.... You still need the y= value though, since it is required.
Here's the adjusted code from implementing the notes above:
ggplot(df1, aes(y = value, x = product, fill = product, group = factor(start_date))) +
geom_col(data = df1, position = position_dodge(width = 0.8),
width = 0.7, inherit.aes = TRUE, size = 0) +
xlab("Product") + ylab("Values") + ylim(0,60) +
scale_fill_manual(values=c("#008FCC", "#FFAA00", "#E60076", "#B00000")) +
stat_summary(aes(x = factor(product), y=value, color='mean1'),
fun = "mean", geom = "crossbar",
size = 1, width = 0.8, inherit.aes = FALSE) +
stat_summary(data = df2, aes(x = factor(product),y=value, color='mean2'),
fun = "mean", geom = "crossbar",
size = 1, width = 0.8, inherit.aes = FALSE) +
theme(legend.position=c(0.75,0.8), legend.direction = 'horizontal') +
scale_color_manual(values=c('blue', 'red'))
Explanation: The point of changing to group=factor(start_date) is so that you maintain the splitting of bars among the different products--a concept known as "dodging". Since your original call to color= was in the aes(, it created a legend item and the geom_col used this for dodging, since the other aesthetics were already mapped to x and y, and the fill= aesthetic was being applied. If you remove color=, you get one bar for each product. Even if you specify position='dodge', geom_col would not dodge them because there's no information about how to do that. That's why you include the group= aesthetic--to give geom_col information on how it should be dodging.
You use aes(... to indicate to ggplot which legends to create. If the aesthetic is mapped to x or y, it just uses that for plotting. group= aesthetics are used for dodging and other group attributes, but basically any other aesthetics (size, shape, color, fill, linetype... etc etc) are used to create legends. If we specify both stat_summary calls to include a color aesthetic, a legend will be created that is combined. The problem here is that there is no column in the dataset (because you have two) to use for mapping to color, so we create one by naming a character ("mean1" and "mean2").
Final point: It might be easier to plot this if you combine your datasets. You may still want to indicate where they came from, so something like this works:
df1$origin_df <- 'df1'
df2$origin_df <- 'df2'
df <- rbind(df1, df2)
Then plot with df and not df1. You can then use one stat_summary call where you specify color=origin_df.

Gradient fill columns using ggplot2 doesn't seem to work

I would like to create a gradient within each bar of this graph going from red (low) to green (high).
At present I am using specific colours within geom_col but want to have each individual bar scale from red to green depending on the value it depicts.
Here is a simplified version of my graph (there is also a geom_line on the graph (among several other things such as titles, axis adjustments, etc.), but it isn't relevant to this question so I have excluded it from the example):
I have removed the hard-coded colours from the columns and tried using things such as scale_fill_gradient (and numerous other similar functions) to apply a gradient to said columns, but nothing seems to work.
Here is what the output is when I use scale_fill_gradient(low = "red", high = "green"):
What I want is for each bar to have its own gradient and not for each bar to represent a step in said gradient.
How can I achieve this using ggplot2?
My code for the above (green) example:
ggplot(data = all_chats_hn,
aes(x = as.Date(date))) +
geom_col(aes(y = total_chats),
colour = "black",
fill = "forestgreen")

I'm not sure if that is possible with geom_col. It is possible by using geom_line and a little data augmentation. We have to use the y value to create a sequence of y values (y_seq), so that the gradient coloring works. We also create y_seq_scaled in case you want each line to have an "independent" gradient.
library(tidyverse)
set.seed(123) # reproducibility
dat <- data_frame(x = 1:10, y = abs(rnorm(10))) %>%
group_by(x) %>%
mutate(y_seq = list(seq(0, y, length.out = 100))) %>% # create sequence
unnest(y_seq) %>%
mutate(y_seq_scaled = (y_seq - mean(y_seq)) / sd(y_seq)) # scale sequence
# gradient for all together
ggplot(dat, aes(x = factor(x), y = y_seq, colour = y_seq))+
geom_line(size = 2)+
scale_colour_gradient(low = 'red', high = 'green')
# independent gradients per x
ggplot(dat, aes(x = factor(x), y = y_seq, colour = y_seq_scaled))+
geom_line(size = 2)+
scale_colour_gradient(low = 'red', high = 'green')

Can I fix overlapping dashed lines in a histogram in ggplot2?

I am trying to plot a histogram of two overlapping distributions in ggplot2. Unfortunately, the graphic needs to be in black and white. I tried representing the two categories with different shades of grey, with transparency, but the result is not as clear as I would like. I tried adding outlines to the bars with different linetypes, but this produced some strange results.
require(ggplot2)
set.seed(65)
a = rnorm(100, mean = 1, sd = 1)
b = rnorm(100, mean = 3, sd = 1)
dat <- data.frame(category = rep(c('A', 'B'), each = 100),
values = c(a, b))
ggplot(data = dat, aes(x = values, linetype = category, fill = category)) +
geom_histogram(colour = 'black', position = 'identity', alpha = 0.4, binwidth = 1) +
scale_fill_grey()
Notice that one of the lines that should appear dotted is in fact solid (at a value of x = 4). I think this must be a result of it actually being two lines - one from the 3-4 bar and one from the 4-5 bar. The dots are out of phase so they produce a solid line. The effect is rather ugly and inconsistent.
Is there any way of fixing this overlap?
Can anyone suggest a more effective way of clarifying the difference between the two categories, without resorting to colour?
Many thanks.

One possibility would be to use a 'hollow histogram', as described here:
# assign your original plot object to a variable
p1 <- ggplot(data = dat, aes(x = values, linetype = category, fill = category)) +
geom_histogram(colour = 'black', position = 'identity', alpha = 0.4, binwidth = 0.4) +
scale_fill_grey()
# p1
# extract relevant variables from the plot object to a new data frame
# your grouping variable 'category' is named 'group' in the plot object
df <- ggplot_build(p1)$data[[1]][ , c("xmin", "y", "group")]
# plot using geom_step
ggplot(data = df, aes(x = xmin, y = y, linetype = factor(group))) +
geom_step()
If you want to vary both linetype and fill, you need to plot a histogram first (which can be filled). Set the outline colour of the histogram to transparent. Then add the geom_step. Use theme_bw to avoid 'grey elements on grey background'
p1 <- ggplot() +
geom_histogram(data = dat, aes(x = values, fill = category),
colour = "transparent", position = 'identity', alpha = 0.4, binwidth = 0.4) +
scale_fill_grey()
df <- ggplot_build(p1)$data[[1]][ , c("xmin", "y", "group")]
df$category <- factor(df$group, labels = c("A", "B"))
p1 +
geom_step(data = df, aes(x = xmin, y = y, linetype = category)) +
theme_bw()

First, I would recommend theme_set(theme_bw()) or theme_set(theme_classic()) (this sets the background to white, which makes it (much) easier to see shades of gray).
Second, you could try something like scale_linetype_manual(values=c(1,3)) -- this won't completely eliminate the artifacts you're unhappy about, but it might make them a little less prominent since linetype 3 is sparser than linetype 2.
Short of drawing density plots instead (which won't work very well for small samples and may not be familiar to your audience), dodging the positions of the histograms (which is ugly), or otherwise departing from histogram conventions, I can't think of a better solution.

Adding points to GGPLOT2 Histogram

I'm trying to produce a histogram that illustrates observed points(a sub-set) on a histogram of all observations. To make it meaningful, I need to color each point differently and place a legend on the plot. My problem is, I can't seem to get a scale to show up on the plot. Below is an example of what I've tried.
subset <-1:8
results = data.frame(x_data = rnorm(5000),TestID=1:5000)
m <- ggplot(results,aes(x=x_data))
m+stat_bin(aes(y=..density..))+
stat_density(colour="blue", fill=NA)+
geom_point(data = results[results$TestID %in% subset,],
aes(x = x_data, y = 0),
colour = as.factor(results$TestID[results$TestID %in% subset]),
size = 5)+
scale_colour_brewer(type="seq", palette=3)
Ideally, I'd like the points to be positioned on the density line(but I'm really unsure of how to make that work, so I'll settle to position them at y = 0). What I need most urgently is a legend which indicates the TestID that corresponds to each of the points in subset.
Thanks a lot to anyone who can help.

This addresses your second point - if you want a legend, you need to include that variable as an aesthetic and map it to a variable (colour in this case). So all you really need to do is move colour = as.factor(results$TestID[results$TestID %in% subset]) inside the call to aes() like so:
ggplot(results,aes(x=x_data)) +
stat_bin(aes(y=..density..))+
stat_density(colour="blue", fill=NA)+
geom_point(data = results[results$TestID %in% subset,],
aes(x = x_data,
y = 0,
colour = as.factor(results$TestID[results$TestID %in% subset])
),
size = 5) +
scale_colour_brewer("Fancy title", type="seq", palette=3)