Change the colour palette in histogram - r

I am trying to change the colours of my histogram, but not sure how to do it, that's my code:
qplot(user, count, data=count_group, geom="histogram", fill=group,
xlab = "users", ylab="count",
main="Users")+
opts(axis.text.x=theme_text(angle=90, hjust=0, size=7))
here is the histogram I get, but the default colours are too bright,
I would like to use colours like this
I tried to add the line, but it didnt work.
scale_fill_brewer(palette = palette)

If you want to use the Brewer Set1 with that many groups, you could do something like this:
library(ggplot2)
count_group <- data.frame(user=factor(rep(1:50, 2)),
count=sample(100, 100, replace=T),
group=factor(rep(LETTERS[1:20], 5)))
library(RColorBrewer)
cols <- colorRampPalette(brewer.pal(9, "Set1"))
ngroups <- length(unique(count_group$group))
qplot(user, count, data=count_group, geom="histogram", fill=group,
xlab = "users", ylab="count") +
opts(axis.text.x=theme_text(angle=90, hjust=0, size=7)) +
scale_fill_manual(values = cols(ngroups))
EDIT
You can create and use multiple colorRampPalettes, e.g. to assign blues to groups A to J and reds to groups K to T:
blues <- colorRampPalette(c('dark blue', 'light blue'))
reds <- colorRampPalette(c('pink', 'dark red'))
qplot(user, count, data=count_group, geom="histogram", fill=group,
xlab = "users", ylab="count") +
opts(axis.text.x=theme_text(angle=90, hjust=0, size=7)) +
scale_fill_manual(values = c(blues(10), reds(10)))
# blues(10) and reds(10) because you want blues for the first ten
# groups, and reds thereafter. Each of these functions are equivalent
# to providing vectors containing ten hex colors representing a gradient
# of blues and a gradient of reds.

An update on jbaums's answer. As far as I can make out, with the new ggplot2 (as of March 2014), the following syntax is available:
p <- qplot(user, count,
data = count_group,
geom = "histogram",
stat = "bin2d",
fill = group,
xlab = "users",
ylab = "count"
)
p <- p + theme(axis.text.x = element_text(angle = 90, hjust = 0, size = 7))
p <- p + scale_fill_manual(values = cols(ngroups))
p
This was a little long for a comment, but it's not a full answer, the rest of the code is as given by jbaums, who must be thanked!

Related

Plotting a vertical normal distribution next to a box plot in R

I'm trying to plot box plots with normal distribution of the underlying data next to the plots in a vertical format like this:
This is what I currently have graphed from an excel sheet uploaded to R:
And the code associated with them:
set.seed(12345)
library(ggplot2)
library(ggthemes)
library(ggbeeswarm)
#graphing boxplot and quasirandom scatterplot together
ggplot(X8_17_20_R_20_60, aes(Type, Diameter)) +
geom_quasirandom(shape=20, fill="gray", color = "gray") +
geom_boxplot(fill="NA", color = c("red4", "orchid4", "dark green", "blue"),
outlier.color = "NA") +
theme_hc()
Is this possible in ggplot2 or R in general? Or is the only way this would be feasible is through something like OrignLab (where the first picture came from)?
You can do something similar to your example plot with the gghalves package:
library(gghalves)
n=0.02
ggplot(iris, aes(Species, Sepal.Length)) +
geom_half_boxplot(center=TRUE, errorbar.draw=FALSE,
width=0.5, nudge=n) +
geom_half_violin(side="r", nudge=n) +
geom_half_dotplot(dotsize=0.5, alpha=0.3, fill="red",
position=position_nudge(x=n, y=0)) +
theme_hc()
There are a few ways to do this. To gain full control over the look of the plot, I would just calculate the curves and plot them. Here's some sample data that's close to your own and shares the same names, so it should be directly applicable:
set.seed(12345)
X8_17_20_R_20_60 <- data.frame(
Diameter = rnorm(4000, rep(c(41, 40, 42, 40), each = 1000), sd = 6),
Type = rep(c("AvgFeret", "CalcDiameter", "Feret", "MinFeret"), each = 1000))
Now we create a little data frame of normal distributions based on the parameters taken from each group:
df <- do.call(rbind, mapply( function(d, n) {
y <- seq(min(d), max(d), length.out = 1000)
data.frame(x = n - 5 * dnorm(y, mean(d), sd(d)) - 0.15, y = y, z = n)
}, with(X8_17_20_R_20_60, split(Diameter, Type)), 1:4, SIMPLIFY = FALSE))
Finally, we draw your plot and add a geom_path with the new data.
library(ggplot2)
library(ggthemes)
library(ggbeeswarm)
ggplot(X8_17_20_R_20_60, aes(Type, Diameter)) +
geom_quasirandom(shape = 20, fill = "gray", color = "gray") +
geom_boxplot(fill="NA", aes(color = Type), outlier.color = "NA") +
scale_color_manual(values = c("red4", "orchid4", "dark green", "blue")) +
geom_path(data = df, aes(x = x, y = y, group = z), size = 1) +
theme_hc()
Created on 2020-08-21 by the reprex package (v0.3.0)

ggplot2 colorbar with discontinuous jump for skewed data

Here is some fake data, x and y, with color information z. z is highly skewed, and as such renders the colorbar uninformative:
set.seed(1)
N <- 100
x <- rnorm(N)
y <- x + rnorm(N)
z <- x+y+rnorm(N)
z[z>2] <- z[z>2]+exp(z[z>2]-2)
d <- data.frame(x,y,z)
ggplot(d, aes(x=x, y=y, color = z)) + geom_point()
I'd like to have most of the colorbar reflect the main range of the the data, but have a box for overflows, say above 5. Something like this:
Is there a way to do this in ggplot2? Note that I would like the colorbar to remain continuous, rather than discrete, for most of its range. I'll probably either discretize or topcode if what I want isn't feasible.
You can get that general plot, although the legends would need more work:
p <- ggplot(d, aes(x=x, y=y, color = z)) + geom_point(size = 5)
p + scale_color_gradient2(
low = 'green', high = 'red', mid = 'grey80', na.value = 'blue', limits= c(-10, 10)
)
You can cheat in some extra legend fluff, e.g.:
ggplot(d, aes(x=x, y=y, color = z, alpha = '>10')) +
geom_point(size = 5) +
scale_color_gradient2(
low = 'green', high = 'red', mid = 'grey80', na.value = 'blue', limits= c(-10, 10),
guide = guide_colorbar(title.position = 'left')
) +
scale_alpha_manual(
values = 1, name = 'z',
guide = guide_legend(
override.aes = list(color = 'blue'), title.position = 'left',
title.theme = element_text(color = 'white', angle = 0)
)
) +
theme(legend.margin = margin(-5, 10, -5, 10))
Note that red/green pallets are bad for the color impaired.
Extending upon Axeman's answer I came up with the following slight hack to get blues into your color scale:
First, define a color map with 20 colors for the values within and 5 for the values outside your range.
cmap <- colorRampPalette(c("green","grey80","red"))(20)
cmap <- append(cmap,rep("blue",5))
Then cut the z values into 20 chunks between -10 and 10 and convert to numeric (resulting in NA's for values above 10). By specifying the cmap in scale_color_gradientn and limits of [1,25] we map values of -10 to 1 (green) and 10 to 20 (red). Finally by specifying breaks we manually add the correct labels (i.e. the 5th category corresponds to values between -6 and -5).
ggplot(d, aes(x=x, y=y, color=as.numeric(cut(z, breaks=seq(-10,10))))) +
geom_point(size=3) +
scale_color_gradientn(colors=cmap, limits=c(1,25), breaks=c(5,11,17,23),
labels=c(-6,0,6,">10"), name="z", na.value = "blue")
Lovely result :)
The only issue is that you will have to make sure that no values will ever fall below -10 as they would also be shown in blue as well using this method.

Overlaying points and controlling size with ggplot2

I want to plot some point estimates with a couple of interval estimates around them, and then to superimpose the true point values using a different color and size, with a legend for the color.
I've tried lots of things. If I just use a new call to geom_point, I can't figure out how to add a legend. Therefore, my current approach resorts to stacking the data on top of itself, which is clumsy. Even then, the graph comes out wrong with big blue points for the True values, with the desired orange points on top of them.
I'd appreciate any help I can get.
nms <- c("2.5%","25%","50%","75%","97.5%","dose","truep")
a <- c(9.00614679684893e- 44,0.000123271800672435,0.0339603711049475,0.187721170170911,0.67452033450121,5,0.040752445325937)
b <- c(1.59502878028266e-25,0.00328588588499889,0.0738203422543555,0.25210200886225,0.714843425007051,10,0.0885844107052267)
cc <- c(1.41975723605948e-14,0.0184599181547097,0.118284929584256,0.311068595276067,0.74339745948793,15,0.141941915501108)
d <- c(0.0311851190805834,0.154722028150561,0.299318020818234,0.50887634580605,0.838779816278485,25,0.359181624981881)
e <- c(0.0529617924263383,0.289588386297245,0.566777817134668,0.883959271416755,0.999999999999317,40,0.680133380561602)
f <- c(0.0598904847882839,0.327655201251564,0.640100529843672,0.950060245074853,1,50,0.768120635812406)
g <- c(0.0641613025760661,0.355626055560067,0.686504841650593,0.978023943968809,1,60,0.823805809980712)
p <- as.data.frame(t(data.frame(a, b, cc, d, e, f, g)))
names(p) <- nms
# Faff duplicating data
p$truep <- 1.2 * p$truep
p2 <- p
p2[, 1:5] <- p$truep # truep is known, so there are no intervals
p3 <- rbind(p2, p)
p3$wh <- rep((c(2, 3)), each=nrow(p))
p3$col <- rep(c("orange", "blue"), each=nrow(p))
ggplot(p3, aes(dose, `50%`)) +
geom_point(aes(size=wh, color=col)) +
scale_size(range=c(5, 7), guide="none") +
scale_color_manual(name="", labels=c("Prior", "True"), values=c("blue", "orange")) +
geom_pointrange(aes(ymin=`2.5%`, ymax=`97.5%`, x=dose), color="blue") +
geom_pointrange(aes(ymin=`25%`, ymax=`75%`, x=dose), color="blue", size=2) +
geom_point(aes(dose, truep), color="orange") +
theme(axis.text.x=element_text(size=12), axis.title.x=element_text(size=14),
axis.text.y=element_text(size=12), axis.title.y=element_text(size=14),
legend.text=element_text(size=12))
R 3.3.1, ggplot2_2.1.1
Thanks,
Harry
I found a solution by splitting the dataset in two parts:
library(dplyr)
priors <- p%>%
mutate(datatype = 'Prior')
truevals <- p%>%
select(dose, truep)%>%
mutate(datatype = 'True')
ggplot(truevals, aes(x = dose, y = truep, colour = datatype))+
geom_pointrange(data = priors, aes(ymin=`25%`, ymax=`75%`, y = `50%`), size=1.5) +
geom_pointrange(data = priors, aes(ymin=`2.5%`, ymax=`97.5%`, y = `50%`))+
geom_point()+
scale_color_manual(name="", values=c("Prior" = "blue", "True" = "orange")) +
theme(axis.text.x=element_text(size=12), axis.title.x=element_text(size=14),
axis.text.y=element_text(size=12), axis.title.y=element_text(size=14),
legend.text=element_text(size=12))
First we plot the two pointranges based on the dataset with priors. Then the actual values. By adding a row with the datatype to both datasets we can add the legend. The result is this graph:
For the method ggplot2::geom_point() there is a show.legend attribute which is NA by default so setting this to TRUE should help.
You can add a legend using the labels attribute as follows:
ggplot2::scale_fill_manual(values = c("red", "black",
labels = c("Number of people",
"Number of birds"))
You are already doing this with labels=c("Prior", "True")
You can also change the look of the legend with:
ggplot2::theme(legend.position = "bottom",
legend.text = ggplot2::element_text(size = 22),
legend.box = "horizontal",
legend.key = ggplot2::element_blank())

ggplot pie charts / bar graph, force a legend [duplicate]

I'm struggling with ggplot (I always do). There are a number of very similar questions about forcing ggplot to include zero value categories in legends - here and here (for example). BUT I (think I) have a slightly different requirement to which all my mucking about with scale_x_discrete and scale_fill_manual has not helped.
Requirement: As you can see; the right-hand plot has no data in the TM=5 category - so is missing. What I need is for that right plot to have category 5 shown on the axis but obviously with no points or box.
Current Plot Script:
#data
plotData <- data.frame("TM" = c(3,2,3,3,3,4,3,2,3,3,4,3,4,3,2,3,2,2,3,2,3,3,3,2,3,1,3,2,2,4,4,3,2,3,4,2,3),
"Score" = c(5,4,4,4,3,5,5,5,5,5,5,3,5,5,4,4,5,4,5,4,5,4,5,4,4,4,4,4,5,4,4,5,3,5,5,5,5))
#vars
xTitle <- bquote("T"["M"])
v.I <- plotData$TM
depVar <- plotData$Score
#plot
p <- ggplot(plotData, aes_string(x=v.I,y=depVar,color=v.I)) +
geom_point() +
geom_jitter(alpha=0.8, position = position_jitter(width = 0.2, height = 0.2)) +
geom_boxplot(width=0.75,alpha=0.5,aes_string(group=v.I)) +
theme_bw() +
labs(x=xTitle) +
labs(y=NULL) +
theme(legend.position='none',
axis.text=element_text(size=10, face="bold"),
axis.title=element_text(size=16))
Attempted Solutions:
drop=False to scales (suggested by #Jarretinha here) totally borks margins and x-axis labels
> plot + scale_x_discrete(drop=FALSE) + scale_fill_manual(drop=FALSE)
Following logic from here and manually setting the labels in scale_fill_manual does nothing and results in the same right-hand plot from example above.
> p + scale_fill_manual(values = c("red", "blue", "green", "purple", "pink"),
labels = c("Cat1", "Cat2", "Cat3", "Cat4", "Cat5"),
drop=FALSE)
Playing with this logic and trying something with scale_x_discrete results in a change to category names on x-axis but the fifth is still missing AND the margins (as attempt 1) are borked again. BUT apparent that scale_x_discrete is important and NOT the whole answer
> p + scale_x_discrete(limits = c("Cat1", "Cat2", "Cat3", "Cat4", "Cat5"), drop=FALSE)
ANSWER for above example courtesy of input from #Bouncyball & #aosmith
#data
plotData <- data.frame("TM" = c(3,2,3,3,3,4,3,2,3,3,4,3,4,3,2,3,2,2,3,2,3,3,3,2,3,1,3,2,2,4,4,3,2,3,4,2,3),
"Score" = c(5,4,4,4,3,5,5,5,5,5,5,3,5,5,4,4,5,4,5,4,5,4,5,4,4,4,4,4,5,4,4,5,3,5,5,5,5))
plotData$TM <- factor(plotData$TM, levels=1:5) # add correct (desired number of factors to input data)
#vars
xTitle <- bquote("T"["M"])
v.I <- plotData$TM
depVar <- plotData$Score
myPalette <- c('#5c9bd4','#a5a5a4','#4770b6','#275f92','#646464','#002060')
#plot
ggplot(plotData, aes_string(x=v.I,y=depVar,color=v.I)) +
geom_jitter(alpha=0.8, position = position_jitter(width = 0.2, height = 0.2)) +
geom_boxplot(width=0.75,alpha=0.5,aes_string(group=v.I)) +
scale_colour_manual(values = myPalette, drop=F) + # new line added here
scale_x_discrete(drop=F) + # new line added here
theme_bw() +
labs(x=xTitle) +
labs(y=NULL) +
theme(legend.position='none',
axis.text=element_text(size=10, face="bold"),
axis.title=element_text(size=16))
Here's a workaround you could use:
# generate dummy data
set.seed(123)
df1 <- data.frame(lets = sample(letters[1:4], 20, replace = T),
y = rnorm(20), stringsAsFactors = FALSE)
# define factor, including the missing category as a level
df1$lets <- factor(df1$lets, levels = letters[1:5])
# make plot
ggplot(df1, aes(x = lets, y = y))+
geom_boxplot(aes(fill = lets))+
geom_point(data = NULL, aes(x = 'e', y = 0), pch = NA)+
scale_fill_brewer(drop = F, palette = 'Set1')+
theme_bw()
Basically, we plot an "empty" point (i.e. pch = NA) so that the category shows up on the x-axis, but has no visible geom associated with it. We also define our discrete variable, lets as a factor with five levels when only four are present in the data.frame. The missing category is the letter e.
NB: You'll have to adjust the positioning of this "empty" point so that it doesn't skew your y axis.
Otherwise, you could use the result from this answer to avoid having to plot an "empty" point.
# generate dummy data
set.seed(123)
df1 <- data.frame(lets = sample(letters[1:4], 20, replace = T),
y = rnorm(20), stringsAsFactors = FALSE)
# define factor, including the missing category as a level
df1$lets <- factor(df1$lets, levels = letters[1:5])
# make plot
ggplot(df1, aes(x = lets, y = y)) +
geom_boxplot(aes(fill = lets)) +
scale_x_discrete(drop = F) +
scale_fill_brewer(drop = F, palette = 'Set1') +
theme_bw()

Overlaying histograms with ggplot2 in R

I am new to R and am trying to plot 3 histograms onto the same graph.
Everything worked fine, but my problem is that you don't see where 2 histograms overlap - they look rather cut off.
When I make density plots, it looks perfect: each curve is surrounded by a black frame line, and colours look different where curves overlap.
Can someone tell me if something similar can be achieved with the histograms in the 1st picture? This is the code I'm using:
lowf0 <-read.csv (....)
mediumf0 <-read.csv (....)
highf0 <-read.csv(....)
lowf0$utt<-'low f0'
mediumf0$utt<-'medium f0'
highf0$utt<-'high f0'
histogram<-rbind(lowf0,mediumf0,highf0)
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
Using #joran's sample data,
ggplot(dat, aes(x=xx, fill=yy)) + geom_histogram(alpha=0.2, position="identity")
note that the default position of geom_histogram is "stack."
see "position adjustment" of this page:
geom_histogram documentation
Your current code:
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
is telling ggplot to construct one histogram using all the values in f0 and then color the bars of this single histogram according to the variable utt.
What you want instead is to create three separate histograms, with alpha blending so that they are visible through each other. So you probably want to use three separate calls to geom_histogram, where each one gets it's own data frame and fill:
ggplot(histogram, aes(f0)) +
geom_histogram(data = lowf0, fill = "red", alpha = 0.2) +
geom_histogram(data = mediumf0, fill = "blue", alpha = 0.2) +
geom_histogram(data = highf0, fill = "green", alpha = 0.2) +
Here's a concrete example with some output:
dat <- data.frame(xx = c(runif(100,20,50),runif(100,40,80),runif(100,0,30)),yy = rep(letters[1:3],each = 100))
ggplot(dat,aes(x=xx)) +
geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)
which produces something like this:
Edited to fix typos; you wanted fill, not colour.
While only a few lines are required to plot multiple/overlapping histograms in ggplot2, the results are't always satisfactory. There needs to be proper use of borders and coloring to ensure the eye can differentiate between histograms.
The following functions balance border colors, opacities, and superimposed density plots to enable the viewer to differentiate among distributions.
Single histogram:
plot_histogram <- function(df, feature) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)))) +
geom_histogram(aes(y = ..density..), alpha=0.7, fill="#33AADE", color="black") +
geom_density(alpha=0.3, fill="red") +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
print(plt)
}
Multiple histogram:
plot_multi_histogram <- function(df, feature, label_column) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
Usage:
Simply pass your data frame into the above functions along with desired arguments:
plot_histogram(iris, 'Sepal.Width')
plot_multi_histogram(iris, 'Sepal.Width', 'Species')
The extra parameter in plot_multi_histogram is the name of the column containing the category labels.
We can see this more dramatically by creating a dataframe with many different distribution means:
a <-data.frame(n=rnorm(1000, mean = 1), category=rep('A', 1000))
b <-data.frame(n=rnorm(1000, mean = 2), category=rep('B', 1000))
c <-data.frame(n=rnorm(1000, mean = 3), category=rep('C', 1000))
d <-data.frame(n=rnorm(1000, mean = 4), category=rep('D', 1000))
e <-data.frame(n=rnorm(1000, mean = 5), category=rep('E', 1000))
f <-data.frame(n=rnorm(1000, mean = 6), category=rep('F', 1000))
many_distros <- do.call('rbind', list(a,b,c,d,e,f))
Passing data frame in as before (and widening chart using options):
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, 'n', 'category')
To add a separate vertical line for each distribution:
plot_multi_histogram <- function(df, feature, label_column, means) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(xintercept=means, color="black", linetype="dashed", size=1)
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
The only change over the previous plot_multi_histogram function is the addition of means to the parameters, and changing the geom_vline line to accept multiple values.
Usage:
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, "n", 'category', c(1, 2, 3, 4, 5, 6))
Result:
Since I set the means explicitly in many_distros I can simply pass them in. Alternatively you can simply calculate these inside the function and use that way.

Resources