Changing colors in histogram - r

I am currently making a histogram with three different variables x,y and z using the following code:
require(ggplot2)
require(reshape2)
set.seed(1)
df <- data.frame(x = rnorm(n = 1000, mean = 2, sd = 0.2),
y = rnorm(n = 1000, mean = 2),
z = rnorm(n = 1000, mean = 2))
ggplot(melt(df), aes(value, fill = variable)) + geom_histogram(position = "dodge")
The code works fine, but I want to change the colors of the three different histograms and I'm not really sure how to do this in this specific case. Maybe to something like red, black and green for instance.
Thanks

Related

R ggplot: overlay two conditional density plots (same binary outcome variable) - possible?

I know how to plot several density curves/polygrams on one plot, but not conditional density plots.
Reproducible example:
require(ggplot2)
# generate data
a <- runif(200, min=0, max = 1000)
b <- runif(200, min=0, max = 1000)
c <- sample(c("A", "B"), 200, replace =T)
df <- data.frame(a,b,c)
# plot 1
ggplot(df, aes(a, fill = c)) +
geom_density(position='fill', alpha = 0.5)
# plot 2
ggplot(df, aes(b, fill = c)) +
geom_density(position='fill', alpha = 0.5)
In my real data I have a bunch of these paired conditional density plots and I would need to overlay one over the other to see (and show) how different (or similar) they are. Does anyone know how to do this?
One way would be to plot the two versions as layers. The overlapping areas will be slightly different, depending on the layer order, based on how alpha works in ggplot2. This may or may not be what you want. You might fiddle with the two alphas, or vary the border colors, to distinguish them more.
ggplot(df, aes(fill = c)) +
geom_density(aes(a), position='fill', alpha = 0.5) +
geom_density(aes(b), position='fill', alpha = 0.5)
For example, you might make it so the fill only applies to one layer, but the other layer distinguishes groups using the group aesthetic, and perhaps a different linetype. This one seems more readable to me, especially if there is a natural ordering to the two variables that justifies putting one in the "foreground" and one in the "background."
ggplot(df) +
geom_density(aes(a, group = c), position='fill', alpha = 0.2, linetype = "dashed") +
geom_density(aes(b, fill = c), position='fill', alpha = 0.5)
I'm not so sure if "on top of one another" is a great idea. Jon's ideas are probably the way to go. But what about just plotting side-by side - our brains can cope with that and we can compare this pretty well.
Make it long, then use facet.
Another option might be an animated graph (see 2nd code chunk below).
require(ggplot2)
#> Loading required package: ggplot2
library(tidyverse)
a <- runif(200, min=0, max = 1000)
b <- runif(200, min=0, max = 1000)
#### BAAAAAD idea to call anything "c" in R!!! Don't do this. ever!
d <- sample(c("A", "B"), 200, replace =T)
df <- data.frame(a,b,d)
df %>% pivot_longer(cols = c(a,b)) %>%
ggplot(aes(value, fill = d)) +
geom_density(position='fill', alpha = 0.5) +
facet_grid(~name)
library(gganimate)
p <- df %>% pivot_longer(cols = c(a,b)) %>%
ggplot(aes(value, fill = d)) +
geom_density(position='fill', alpha = 0.5) +
labs(title = "{closest_state}")
p_anim <- p + transition_states(name)
animate(p_anim, duration = 2, fps = 5)
Created on 2022-06-14 by the reprex package (v2.0.1)
Although it is not the overlay you might have thought of, it facilitates the comparison of density curves:
library(tidyverse)
library(ggridges)
library(truncnorm)
DF <- tibble(
alpha = rtruncnorm(n = 200, a = 0, b = 1000, mean = 500, sd = 50),
beta = rtruncnorm(n = 200, a = 0, b = 1000, mean = 550, sd = 50)
)
DF <- DF %>%
pivot_longer(c(alpha, beta), names_to = "name", values_to = "meas") %>%
mutate(name = factor(name))
DF %>%
ggplot(aes(meas, name, fill = factor(stat(quantile)))) +
stat_density_ridges(
geom = "density_ridges_gradient",
calc_ecdf = T,
quantiles = 4,
quantile_lines = T
) +
scale_fill_viridis_d(name = "Quartiles")

ggplot: transperancy of histogram as function of stat(count)

I'm trying to make a scaled histogram in a such a way, that transparency of each "column" (bin?) depends on the number of observations in a given range of x. Here is my code:
set.seed(1)
test = data.frame(x = rnorm(200, mean = 0, sd = 10),
y = as.factor(sample(c(0,1), replace=TRUE, size=100)))
threshold = 20
ggplot(test,
aes(x = x))+
geom_histogram(aes(fill = y, alpha = stat(count) > threshold),
position = "fill", bins = 10)
Basically I want to make plots that will looks like this:
however my code generate the plots there transparency are applied based on the count after grouping that ends up with hanging column like this:
For this example, in order to simulate a "proper" plot I just adjust the threshold, but I need alpha to consider sum of count from both groups in a given "column"(bin).
UPDATE:
I also want it to work with faceted plots in a such a way that highlighted area in each facet was independent from other facets. Approach that proposed #Stefan works perfect for the individual plot, but in faceted plot highlights the same area at all facets.
library(ggplot2)
set.seed(1)
test = data.frame(x = rnorm(1000, mean = 0, sd = 10),
y = as.factor(sample(c(0,1), replace=TRUE, size=1000)),
n = as.factor(sample(c(0,1,2), replace=TRUE, size=1000)),
m = as.factor(sample(c(0,1,3,4), replace=TRUE, size=1000)))
f = function(..count.., ..x..) tapply(..count.., factor(..x..), sum)[factor(..x..)]
threshold = 10
ggplot(test,
aes(x = x))+
geom_histogram(aes(fill = y, alpha = f(..count.., ..x..) > threshold),
position = "fill", bins = 10)+
facet_grid(rows = vars(n),
cols = vars(m))
This could be achieved like so:
As the count computed by stat_count is the number of obs after grouping we have to manually aggregate the count over groups to get the total count per bin.
To aggregate the counts per bin I use tapply, where I make use of the .. notation to get the variables computed by stat_count.
As the grouping variable I make use of the computed variable ..x.. which to the best of my knowledge is not documented. Basically ..x.. contains by default the midpoints of the bins and as such can be used as an identifier for the bins. However, as these are continuous values we have convert them to a factor.
Finally, to make the code more readable I use a auxilliary function to compute the aggregate counts. Additionally I double the threshold value to 20.
library(ggplot2)
set.seed(1)
test <- data.frame(
x = rnorm(200, mean = 0, sd = 10),
y = as.factor(sample(c(0, 1), replace = TRUE, size = 100))
)
threshold <- 20
f <- function(..count.., ..x..) tapply(..count.., factor(..x..), sum)[factor(..x..)]
p <- ggplot(
test,
aes(x = x)
) +
geom_histogram(aes(fill = y, alpha = f(..count.., ..x..) > threshold),
position = "fill", bins = 10
)
p
EDIT To allow for facetting we have to pass the function the ..PANEL.. identifier as an addtional argument. Instead of using tapply I now use dplyr::group_by and dplyr::add_count to compute the total count per bin and facet panel:
library(ggplot2)
library(dplyr)
set.seed(1)
test <- data.frame(
x = rnorm(200, mean = 0, sd = 10),
y = as.factor(sample(c(0, 1), replace = TRUE, size = 100)),
type = rep(c("A", "B"), each = 100)
)
threshold <- 20
f <- function(count, x, PANEL) {
data.frame(count, x, PANEL) %>%
add_count(x, PANEL, wt = count) %>%
pull(n)
}
p <- ggplot(
test,
aes(x = x)
) +
geom_histogram(aes(fill = y, alpha = f(..count.., ..x.., ..PANEL..) > threshold),
position = "fill", bins = 10
) +
facet_wrap(~type)
p
#> Warning: Using alpha for a discrete variable is not advised.
#> Warning: Removed 2 rows containing missing values (geom_bar).

ggplot2: split scatter plot by categorical variable

I am trying to generate a scatter plot where the x-axis is several categories of a continuous variable. The closest thing to it would be a Manhattan plot, where the x-axis is split by chromosome (categorical), but within each category the values are continuous.
Data:
chr <- sample(x = c(1,2), replace = T, size = 1000)
bp <- as.integer(runif(n = 1000, min = 0, max = 10000))
p <- runif(n = 1000, min = 0, max = 1)
df <- data.frame(chr,bp,p)
Starting Point:
ggplot(df, aes(y = -log10(p), x =bp)) + geom_point(colour=chr)
The red and black points should be separate categories along the x-axis.
I am not sure if I have understood your question. Probably you are looking for facets. See the example.
require(ggplot2)
chr <- sample(x = c(1,2), replace = T, size = 1000)
bp <- as.integer(runif(n = 1000, min = 0, max = 10000))
p <- runif(n = 1000, min = 0, max = 1)
df <- data.frame(chr,bp,p)
ggplot(df, aes(y = -log10(p), x = bp)) +
geom_point(aes(colour = factor(chr))) +
facet_wrap("chr")
If you really want to do this in a single plot instead of facets, you could conditionally rescale your x variable and then manually adjust the labels, e.g.:
df %>%
mutate(bp.scaled = ifelse(chr == 2, bp + 10000, bp)) %>%
ggplot(aes(y = -log10(p), x = bp.scaled)) + geom_point(colour=chr) +
scale_x_continuous(breaks = seq(0,20000,2500),
labels = c(seq(0,10000,2500), seq(2500,10000,2500)))
Result:

Mix color and fill aesthetics in ggplot

I wonder if there is the possibility to change the fill main colour according to a categorical variable
Here is a reproducible example
df = data.frame(x = c(rnorm(10, mean = 0),
rnorm(10, mean = 3)),
y = c(rnorm(10, mean = 0),
rnorm(10, mean = 3)),
grp = c(rep('a', times = 10),
rep('b', times = 10)),
val = rep(1:10, times = 2))
ggplot(data = df,
aes(x = x,
y = y)) +
geom_point(pch = 21,
aes(color = grp,
fill = val,
size = val))
Of course it is easy to change the circle colour/shape, according to the variable grp, but I'd like to have the a group in shades of red and the b group in shades of blue.
I also thought about using facets, but don't know if the fill gradient can be changed for the two panels.
Anyone knows if that can be done, without gridExtra?
Thanks!
I think there are two ways to do this. The first is using the alpha aesthetic for your val column. This is a quick and easy way to accomplish your goal but may not be exactly what you want:
ggplot(data = df,
aes(x = x,
y = y)) +
geom_point(pch = 21,
aes(alpha=val,
fill = grp,
size = val)) + theme_minimal()
The second way would be to do something similar to this post: Vary the color gradient on a scatter plot created with ggplot2. I edited the code slightly so its not a range from white to your color of interest but from a lighter color to a darker color. This requires a little bit of work and using the scale_fill_identity function which basically takes a variable that has the colors you want and maps them directly to each point (so it doesn't do any scaling).
This code is:
#Rescale val to [0,1]
df$scaled_val <- rescale(df$val)
low_cols <- c("firebrick1","deepskyblue")
high_cols <- c("darkred","deepskyblue4")
df$col <- ddply(df, .(grp), function(x)
data.frame(col=apply(colorRamp(c(low_cols[as.numeric(x$grp)[1]], high_cols[as.numeric(x$grp)[1]]))(x$scaled_val),
1,function(x)rgb(x[1],x[2],x[3], max=255)))
)$col
df
ggplot(data = df,
aes(x = x,
y = y)) +
geom_point(pch = 21,
aes(
fill = col,
size = val)) + theme_minimal() +scale_fill_identity()
Thanks to this other post I found a way to visualize the fill bar in the legend, even though that wasn't what I meant to do.
Here's the ouptup
And the code
df = data.frame(x = c(rnorm(10, mean = 0),
rnorm(10, mean = 3)),
y = c(rnorm(10, mean = 0),
rnorm(10, mean = 3)),
grp = factor(c(rep('a', times = 10),
rep('b', times = 10)),
levels = c('a', 'b')),
val = rep(1:10, times = 2)) %>%
group_by(grp) %>%
mutate(scaledVal = rescale(val)) %>%
ungroup %>%
mutate(scaledValOffSet = scaledVal + 100*(as.integer(grp) - 1))
scalerange <- range(df$scaledVal)
gradientends <- scalerange + rep(c(0,100,200), each=2)
ggplot(data = df,
aes(x = x,
y = y)) +
geom_point(pch = 21,
aes(fill = scaledValOffSet,
size = val)) +
scale_fill_gradientn(colours = c('white',
'darkred',
'white',
'deepskyblue4'),
values = rescale(gradientends))
Basically one should rescale fill values (e.g. between 0 and 1) and separate them using another order of magnitude, provided by the categorical variable grp.
This is not what I wanted though: the snippet can be improved, of course, to make the whole thing less manual, but still lacks the simple usual discrete fill legend.

Boxplot width in ggplot with cross classified groups

I am making boxplots with ggplot with data that is classified by 2 factor variables. I'd like to have the box sizes reflect sample size via varwidth = TRUE but when I do this the boxes overlap.
1) Some sample data with a 3 x 2 structure
data <- data.frame(group1= sample(c("A","B","C"),100, replace = TRUE),group2= sample(c("D","E"),100, replace = TRUE) ,response = rnorm(100, mean = 0, sd = 1))
2) Default boxplots: ggplot without variable width
ggplot(data = data, aes(y = response, x = group1, color = group2)) + geom_boxplot()
I like how the first level of grouping is shown.
Now I try to add variable widths...
3) ...and What I get when varwidth = TRUE
ggplot(data = data, aes(y = response, x = group1, color = group2)) + geom_boxplot(varwidth = T)
This overlap seems to occur whether I use color = group2 or group = group2 in both the main call to ggplot and in the geom_boxplot statement. Fussing with position_dodge doesn't seem to help either.
4) A solution I don't like visually is to make unique factors by combining my group1 and group2
data$grp.comb <- paste(data$group1, data$group2)
ggplot(data = data, aes(y = response, x = grp.comb, color = group2)) + geom_boxplot()
I prefer having things grouped to reflect the cross classification
5) The way forward:
I'd like to either a)figure out how to either make varwidth = TRUE not cause the boxes to overlap or b)manually adjusted the space between the combined groups so that boxes within the 1st level of grouping are closer together.
I think your problem can be solved best by using facet_wrap.
library(ggplot2)
data <- data.frame(group1= sample(c("A","B","C"),100, replace = TRUE), group2=
sample(c("D","E"),100, replace = TRUE) ,response = rnorm(100, mean = 0, sd = 1))
ggplot(data = data, aes(y = response, x = group2, color = group2)) +
geom_boxplot(varwidth = TRUE) +
facet_wrap(~group1)
Which gives:
A recent update to ggplot2 makes it so that the code provided by #N Brouwer in (3) works as expected:
# library(devtools)
# install_github("tidyverse/ggplot2")
packageVersion("ggplot2") # works with v2.2.1.9000
library(ggplot2)
set.seed(1234)
data <- data.frame(group1= sample(c("A","B","C"), 100, replace = TRUE),
group2= sample(c("D","E"), 100, replace = TRUE),
response = rnorm(100, mean = 0, sd = 1))
ggplot(data = data, aes(y = response, x = group1, color = group2)) +
geom_boxplot(varwidth = T)
(I'm a new user and can't post images inline)
fig 1
This question has been answered here ggplot increase distance between boxplots
The answer involves using the position = position_dodge() argument of geom_boxplot().
For your example:
data <- data.frame(group1= sample(c("A","B","C"),100, replace = TRUE), group2=
sample(c("D","E"),100, replace = TRUE) ,response = rnorm(100, mean = 0, sd = 1))
ggplot(data = data, aes(y = response, x = group1, color = group2)) +
geom_boxplot(position = position_dodge(1))

Resources