Can I fix overlapping dashed lines in a histogram in ggplot2?

Can I fix overlapping dashed lines in a histogram in ggplot2? - r

I am trying to plot a histogram of two overlapping distributions in ggplot2. Unfortunately, the graphic needs to be in black and white. I tried representing the two categories with different shades of grey, with transparency, but the result is not as clear as I would like. I tried adding outlines to the bars with different linetypes, but this produced some strange results.
require(ggplot2)
set.seed(65)
a = rnorm(100, mean = 1, sd = 1)
b = rnorm(100, mean = 3, sd = 1)
dat <- data.frame(category = rep(c('A', 'B'), each = 100),
values = c(a, b))
ggplot(data = dat, aes(x = values, linetype = category, fill = category)) +
geom_histogram(colour = 'black', position = 'identity', alpha = 0.4, binwidth = 1) +
scale_fill_grey()
Notice that one of the lines that should appear dotted is in fact solid (at a value of x = 4). I think this must be a result of it actually being two lines - one from the 3-4 bar and one from the 4-5 bar. The dots are out of phase so they produce a solid line. The effect is rather ugly and inconsistent.
Is there any way of fixing this overlap?
Can anyone suggest a more effective way of clarifying the difference between the two categories, without resorting to colour?
Many thanks.

One possibility would be to use a 'hollow histogram', as described here:
# assign your original plot object to a variable
p1 <- ggplot(data = dat, aes(x = values, linetype = category, fill = category)) +
geom_histogram(colour = 'black', position = 'identity', alpha = 0.4, binwidth = 0.4) +
scale_fill_grey()
# p1
# extract relevant variables from the plot object to a new data frame
# your grouping variable 'category' is named 'group' in the plot object
df <- ggplot_build(p1)$data[[1]][ , c("xmin", "y", "group")]
# plot using geom_step
ggplot(data = df, aes(x = xmin, y = y, linetype = factor(group))) +
geom_step()
If you want to vary both linetype and fill, you need to plot a histogram first (which can be filled). Set the outline colour of the histogram to transparent. Then add the geom_step. Use theme_bw to avoid 'grey elements on grey background'
p1 <- ggplot() +
geom_histogram(data = dat, aes(x = values, fill = category),
colour = "transparent", position = 'identity', alpha = 0.4, binwidth = 0.4) +
scale_fill_grey()
df <- ggplot_build(p1)$data[[1]][ , c("xmin", "y", "group")]
df$category <- factor(df$group, labels = c("A", "B"))
p1 +
geom_step(data = df, aes(x = xmin, y = y, linetype = category)) +
theme_bw()

First, I would recommend theme_set(theme_bw()) or theme_set(theme_classic()) (this sets the background to white, which makes it (much) easier to see shades of gray).
Second, you could try something like scale_linetype_manual(values=c(1,3)) -- this won't completely eliminate the artifacts you're unhappy about, but it might make them a little less prominent since linetype 3 is sparser than linetype 2.
Short of drawing density plots instead (which won't work very well for small samples and may not be familiar to your audience), dodging the positions of the histograms (which is ugly), or otherwise departing from histogram conventions, I can't think of a better solution.

Related

Overlay violin plots in r

I am trying to plot overlaying violin plots by condition within the same variable.
Var <- rnorm(100,50)
Cond <- rbinom(100, 1, 0.5)
df2 <- data.frame(Var,Cond)
ggplot(df2)+
aes(x=factor(Cond),y=Var, colour = Cond)+
geom_violin(alpha=0.3,position="identity")+
coord_flip()
So, where do I specify that I want them to overlap? Preferably, I want them to become more lighter when overlapping and darker colour when not so that their differences are clear. Any clues?

If you don't want them to have different (flipped) x-values, set x to a constant instead of x = factor(Cond). And if you want them filled in, set a fill aesthetic.
ggplot(df2)+
aes(x=0,y=Var, colour = Cond, fill = Cond)+
geom_violin(alpha=0.3,position="identity")+
coord_flip()
coord_flip isn't often needed anymore--since version 3.3.0 (released in early 2020) all geoms can point in either direction. I'd recommend simplifying as below for a similar result.
df2$Cond = factor(df2$Cond)
ggplot(df2) +
aes(y = 0, x = Var, colour = Cond, fill = Cond) +
geom_violin(alpha = 0.3, position = "identity")

ggplot2 density of one dimension in 2D plot

I would like to plot a background that captures the density of points in one dimension in a scatter plot. This would serve a similar purpose to a marginal density plot or a rug plot. I have a way of doing it that is not particularly elegant, I am wondering if there's some built-in functionality I can use to produce this kind of plot.
Mainly there are a few issues with the current approach:
Alpha overlap at boundaries causes banding at lower resolution as seen here. - Primary objective, looking for a geom or other solution that draws a nice continuous band filled with specific colour. Something like geom_density_2d() but with the stat drawn from only the X axis.
"Background" does not cover expanded area, can use coord_cartesian(expand = FALSE) but would like to cover regular margins. - Not a big deal, is nice-to-have but not required.
Setting scale_fill "consumes" the option for the plot, not allowing it to be set independently for the points themselves. - This may not be easily achievable, independent palettes for layers appears to be a fundamental issue with ggplot2.
data(iris)
dns <- density(iris$Sepal.Length)
dns_df <- tibble(
x = dns$x,
density = dns$y
)%>%
mutate(
start = x - mean(diff(x))/2,
end = x + mean(diff(x))/2
)
ggplot() +
geom_rect(
data = dns_df,
aes(xmin = start, xmax = end, fill = density),
ymin = min(iris$Sepal.Width),
ymax = max(iris$Sepal.Width),
alpha = 0.5) +
scale_fill_viridis_c(option = "A") +
geom_point(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_rug(data = iris, aes(x = Sepal.Length))

This is a bit of a hacky solution because it (ab)uses knowledge of how objects are internally parametrised to get what you want, which will yield some warnings, but gets you want you'd want.
First, we'll use a geom_raster() + stat_density() decorated with some choice after_stat()/stage() delayed evaluation. Normally, this would result in a height = 1 strip, but by setting the internal parameters ymin/ymax to infinitives, we'll have the strip extend the whole height of the plot. Using geom_raster() resolves the alpha issue you were having.
library(ggplot2)
p <- ggplot(iris) +
geom_raster(
aes(Sepal.Length,
y = mean(Sepal.Width),
fill = after_stat(density),
ymin = stage(NULL, after_scale = -Inf),
ymax = stage(NULL, after_scale = Inf)),
stat = "density", alpha = 0.5
)
#> Warning: Ignoring unknown aesthetics: ymin, ymax
p
#> Warning: Duplicated aesthetics after name standardisation: NA
Next, we add a fill scale, and immediately follow that by ggnewscale::new_scale_fill(). This allows another layer to use a second fill scale, as demonstrated with fill = Species.
p <- p +
scale_fill_viridis_c(option = "A") +
ggnewscale::new_scale_fill() +
geom_point(aes(Sepal.Length, Sepal.Width, fill = Species),
shape = 21) +
geom_rug(aes(Sepal.Length))
p
#> Warning: Duplicated aesthetics after name standardisation: NA
Lastly, to get rid of the padding at the x-axis, we can manually extend the limits and then shrink in the expansion. It allows for an extended range over which the density can be estimated, making the raster fill the whole area. There is some mismatch between how ggplot2 and scales::expand_range() are parameterised, so the exact values are a bit of trial and error.
p +
scale_x_continuous(
limits = ~ scales::expand_range(.x, mul = 0.05),
expand = c(0, -0.2)
)
#> Warning: Duplicated aesthetics after name standardisation: NA
Created on 2022-07-04 by the reprex package (v2.0.1)

This doesn't solve your problem (I'm not sure I understand all the issues correctly), but perhaps it will help:
Background does not cover expanded area, can use coord_cartesian(expand = FALSE) but would like to cover regular margins.
If you make the 'background' larger and use coord_cartesian() you can get the same 'filled-to-the-edges' effect; would this work for your use-case?
Alpha overlap at boundaries causes banding at lower resolution as seen here.
I wasn't able to fix the banding completely, but my approach below appears to reduce it.
Setting scale_fill "consumes" the option for the plot, not allowing it to be set independently for the points themselves.
If you use geom_segment() you can map density to colour, leaving fill available for e.g. the points. Again, not sure if this is a useable solution, just an idea that might help.
library(tidyverse)
data(iris)
dns <- density(iris$Sepal.Length)
dns_df <- tibble(
x = dns$x,
density = dns$y
) %>%
mutate(
start = x - mean(diff(x))/2,
end = x + mean(diff(x))/2
)
ggplot() +
geom_segment(
data = dns_df,
aes(x = start, xend = end,
y = min(iris$Sepal.Width) * 0.9,
yend = max(iris$Sepal.Width) * 1.1,
color = density), alpha = 0.5) +
coord_cartesian(ylim = c(min(iris$Sepal.Width),
max(iris$Sepal.Width)),
xlim = c(min(iris$Sepal.Length),
max(iris$Sepal.Length))) +
scale_color_viridis_c(option = "A", alpha = 0.5) +
scale_fill_viridis_d() +
geom_point(data = iris, aes(x = Sepal.Length,
y = Sepal.Width,
fill = Species),
shape = 21) +
geom_rug(data = iris, aes(x = Sepal.Length))
Created on 2022-07-04 by the reprex package (v2.0.1)

R code of scatter plot for three variables

Hi I am trying to code for a scatter plot for three variables in R:
Race= [0,1]
YOI= [90,92,94]
ASB_mean = [1.56, 1.59, 1.74]
Antisocial <- read.csv(file = 'Antisocial.csv')
Table_1 <- ddply(Antisocial, "YOI", summarise, ASB_mean = mean(ASB))
Table_1
Race <- unique(Antisocial$Race)
Race
ggplot(data = Table_1, aes(x = YOI, y = ASB_mean, group_by(Race))) +
geom_point(colour = "Black", size = 2) + geom_line(data = Table_1, aes(YOI,
ASB_mean), colour = "orange", size = 1)
Image of plot: https://drive.google.com/file/d/1E-ePt9DZJaEr49m8fguHVS0thlVIodu9/view?usp=sharing
Data file: https://drive.google.com/file/d/1UeVTJ1M_eKQDNtvyUHRB77VDpSF1ASli/view?usp=sharing
Can someone help me understand where I am making mistake? I want to plot mean ASB vs YOI grouped by Race. Thanks.

I am not sure what is your desidered output. Maybe, if I well understood your question I Think that you want somthing like this.
g_Antisocial <- Antisocial %>%
group_by(Race) %>%
summarise(ASB = mean(ASB),
YOI = mean(YOI))
Antisocial %>%
ggplot(aes(x = YOI, y = ASB, color = as_factor(Race), shape = as_factor(Race))) +
geom_point(alpha = .4) +
geom_point(data = g_Antisocial, size = 4) +
theme_bw() +
guides(color = guide_legend("Race"), shape = guide_legend("Race"))
and this is the output:

#Maninder: there are a few things you need to look at.
First of all: The grammar of graphics of ggplot() works with layers. You can add layers with different data (frames) for the different geoms you want to plot.
The reason why your code is not working is that you mix the layer call and or do not really specify (and even mix) what is the scatter and line visualisation you want.
(I) Use ggplot() + geom_point() for a scatter plot
The ultimate first layer is: ggplot(). Think of this as your drawing canvas.
You then speak about adding a scatter plot layer, but you actually do not do it.
For example:
# plotting antisocal data set
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race)))
will plot your Antiscoial data set using the scatter, i.e. geom_point() layer.
Note that I put Race as a factor to have a categorical colour scheme otherwise you might end up with a continous palette.
(II) line plot
In analogy to above, you would get for the line plot the following:
# plotting Table_1
ggplot() +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean))
I save showing the plot of the line.
(III) combining different layers
# putting both together
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race))) +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean)) +
## this is to set the legend title and have a nice(r) name in your colour legend
labs(colour = "Race")
This yields:
That should explain how ggplot-layering works. Keep an eye on the datasets and geoms that you want to use. Before working with inheritance in aes, I recommend to keep the data= and aes() call in the geom_xxxx. This avoids confustion.
You may want to explore with geom_jitter() instead of geom_point() to get a bit of a better presentation of your dataset. The "few" points plotted are the result of many datapoints in the same position (and overplotted).
Moving away from plotting to your question "I want to plot mean ASB vs YOI grouped by Race."
I know too little about your research to fully comprehend what you mean with that.
I take it that the mean ASB you calculated over the whole population is your reference (aka your Table_1), and you would like to see how the Race groups feature vs this population mean.
One option is to group your race data points and show them as boxplots for each YOI.
This might be what you want. The boxplot gives you the median and quartiles, and you can compare this per group against the calculated ASB mean.
For presentation purposes, I highlighted the line by increasing its size and linetype. You can play around with the colours, etc. to give you the aesthetics you aim for.
Please note, that for the grouped boxplot, you also have to treat your integer variable YOI, I coerced into a categorical factor. Boxplot works with fill for the body (colour sets only the outer line). In this setup, you also need to supply a group value to geom_line() (I just assigned it to 1, but that is arbitrary - in other contexts you can assign another variable here).
ggplot() +
geom_boxplot(data = Antisocial, aes(x = as.factor(YOI), y = ASB, fill = as.factor(Race))) +
geom_line(data = Table_1, aes(x = as.factor(YOI), y = ASB_mean, group = 1)
, size = 2, linetype = "dashed") +
labs(x = "YOI", fill = "Race")
Hope this gets you going!

how to add legends from stat_summary and remove legends from the main plot?

I want to plot the values of df1 by two groups i.e. product and start_date and also plot a crossbar with the mean of df1(blue) and mean of df2(red) as in the attached diagram.
df1 <- data.frame(product = c("A","A","A","A","A","A","A","B","B","B","B","B","B","B","C","C","C","C","C","C","C","D","D","D","D","D","D","D"),
start_date =as.Date(c('2020-02-01', '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05', '2020-02-06', '2020-02-07')),
value = c(15.71,17.37,19.93,14.28,15.85,10.5,8.58,5.62,5.19,5.44,4.6,7.04,6.29,3.3,20.35,27.92,23.07,12.83,22.28,21.32,31.46,34.82,23.68,29.11,14.48,25.2,16.91,27.79))
df2 <- data.frame(product = c("A","A","A","A","A","A","B","B","B","B","B","B","C","C","C","C","C","C","D","D","D","D","D","D"),
start_date =as.Date(c('2019-07-09', '2019-07-10', '2019-07-11', '2019-07-12', '2019-07-13', '2019-07-14')),
value = c(9.06,10.74,14.64,7.67,8.72,11.21,4.76,4.53,3.81,4.32,3.95,5.2,20.36,21.17,19.51,16.25,17.93,16.94,14.51,14.65,23.28,10.84,16.71,12.48))
PLOT GRAPH
graph1 <- ggplot(df1, aes(
y = value, x = product, fill = product, color = factor(start_date))) +
geom_col(data = df1, stat = "identity",position = position_dodge(width = 0.8), width = 0.7, inherit.aes = TRUE, size = 0) +
xlab("Product") + ylab("Values") + ylim(c(0,40)) +
scale_fill_manual(values=c("#008FCC", "#FFAA00", "#E60076", "#B00000")) +
stat_summary(data = df1, aes(x = factor(product),y = value),fun = "mean",geom = "crossbar", color = "blue", size = 1, width = 0.8, inherit.aes = FALSE) +
stat_summary(data = df2, aes(x = factor(product),y = value),fun = "mean",geom = "crossbar", color = "red", size = 1, width = 0.8, inherit.aes = FALSE)
Is there any way to remove the borders of the bar plots and add legend of the two crossbars at the top right corner of the plot ?
Additionally I would like to know if is there a way to add the just the "date" from df1 below each bar in the plot ?

Your question about adjusting the plot has multiple parts. To summarize a few points:
Change from color=factor(start_date) to group= to remove the color around bars, but maintain the separation of individual bars by start_date
Use theme(legend.position=... and specify precise placement of legend within plot area. Use theme(legend.direction='horizontal') too when appropriate.
Add color= attribute into the stat_summary(geom='crossbar'...) calls in order to "add" them both to a legend, then use scale_color_manual to specify color if you don't like the default.
Minor suggestion: Use ylim(X,Y) instead of ylim(c(X,Y)). It's not necessary to put the limits into a vector, since ylim can accept that instead and it's simpler. Note that it still works either way, so that's why this point is minor.
You don't need the data=df1 for the first stat_summary call, since it's the default mapping based on the data= value set in ggplot(.... You still need the y= value though, since it is required.
Here's the adjusted code from implementing the notes above:
ggplot(df1, aes(y = value, x = product, fill = product, group = factor(start_date))) +
geom_col(data = df1, position = position_dodge(width = 0.8),
width = 0.7, inherit.aes = TRUE, size = 0) +
xlab("Product") + ylab("Values") + ylim(0,60) +
scale_fill_manual(values=c("#008FCC", "#FFAA00", "#E60076", "#B00000")) +
stat_summary(aes(x = factor(product), y=value, color='mean1'),
fun = "mean", geom = "crossbar",
size = 1, width = 0.8, inherit.aes = FALSE) +
stat_summary(data = df2, aes(x = factor(product),y=value, color='mean2'),
fun = "mean", geom = "crossbar",
size = 1, width = 0.8, inherit.aes = FALSE) +
theme(legend.position=c(0.75,0.8), legend.direction = 'horizontal') +
scale_color_manual(values=c('blue', 'red'))
Explanation: The point of changing to group=factor(start_date) is so that you maintain the splitting of bars among the different products--a concept known as "dodging". Since your original call to color= was in the aes(, it created a legend item and the geom_col used this for dodging, since the other aesthetics were already mapped to x and y, and the fill= aesthetic was being applied. If you remove color=, you get one bar for each product. Even if you specify position='dodge', geom_col would not dodge them because there's no information about how to do that. That's why you include the group= aesthetic--to give geom_col information on how it should be dodging.
You use aes(... to indicate to ggplot which legends to create. If the aesthetic is mapped to x or y, it just uses that for plotting. group= aesthetics are used for dodging and other group attributes, but basically any other aesthetics (size, shape, color, fill, linetype... etc etc) are used to create legends. If we specify both stat_summary calls to include a color aesthetic, a legend will be created that is combined. The problem here is that there is no column in the dataset (because you have two) to use for mapping to color, so we create one by naming a character ("mean1" and "mean2").
Final point: It might be easier to plot this if you combine your datasets. You may still want to indicate where they came from, so something like this works:
df1$origin_df <- 'df1'
df2$origin_df <- 'df2'
df <- rbind(df1, df2)
Then plot with df and not df1. You can then use one stat_summary call where you specify color=origin_df.

Gradient fill columns using ggplot2 doesn't seem to work

I would like to create a gradient within each bar of this graph going from red (low) to green (high).
At present I am using specific colours within geom_col but want to have each individual bar scale from red to green depending on the value it depicts.
Here is a simplified version of my graph (there is also a geom_line on the graph (among several other things such as titles, axis adjustments, etc.), but it isn't relevant to this question so I have excluded it from the example):
I have removed the hard-coded colours from the columns and tried using things such as scale_fill_gradient (and numerous other similar functions) to apply a gradient to said columns, but nothing seems to work.
Here is what the output is when I use scale_fill_gradient(low = "red", high = "green"):
What I want is for each bar to have its own gradient and not for each bar to represent a step in said gradient.
How can I achieve this using ggplot2?
My code for the above (green) example:
ggplot(data = all_chats_hn,
aes(x = as.Date(date))) +
geom_col(aes(y = total_chats),
colour = "black",
fill = "forestgreen")

I'm not sure if that is possible with geom_col. It is possible by using geom_line and a little data augmentation. We have to use the y value to create a sequence of y values (y_seq), so that the gradient coloring works. We also create y_seq_scaled in case you want each line to have an "independent" gradient.
library(tidyverse)
set.seed(123) # reproducibility
dat <- data_frame(x = 1:10, y = abs(rnorm(10))) %>%
group_by(x) %>%
mutate(y_seq = list(seq(0, y, length.out = 100))) %>% # create sequence
unnest(y_seq) %>%
mutate(y_seq_scaled = (y_seq - mean(y_seq)) / sd(y_seq)) # scale sequence
# gradient for all together
ggplot(dat, aes(x = factor(x), y = y_seq, colour = y_seq))+
geom_line(size = 2)+
scale_colour_gradient(low = 'red', high = 'green')
# independent gradients per x
ggplot(dat, aes(x = factor(x), y = y_seq, colour = y_seq_scaled))+
geom_line(size = 2)+
scale_colour_gradient(low = 'red', high = 'green')