I am new to R and am trying to plot 3 histograms onto the same graph.
Everything worked fine, but my problem is that you don't see where 2 histograms overlap - they look rather cut off.
When I make density plots, it looks perfect: each curve is surrounded by a black frame line, and colours look different where curves overlap.
Can someone tell me if something similar can be achieved with the histograms in the 1st picture? This is the code I'm using:
lowf0 <-read.csv (....)
mediumf0 <-read.csv (....)
highf0 <-read.csv(....)
lowf0$utt<-'low f0'
mediumf0$utt<-'medium f0'
highf0$utt<-'high f0'
histogram<-rbind(lowf0,mediumf0,highf0)
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
Using #joran's sample data,
ggplot(dat, aes(x=xx, fill=yy)) + geom_histogram(alpha=0.2, position="identity")
note that the default position of geom_histogram is "stack."
see "position adjustment" of this page:
geom_histogram documentation
Your current code:
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
is telling ggplot to construct one histogram using all the values in f0 and then color the bars of this single histogram according to the variable utt.
What you want instead is to create three separate histograms, with alpha blending so that they are visible through each other. So you probably want to use three separate calls to geom_histogram, where each one gets it's own data frame and fill:
ggplot(histogram, aes(f0)) +
geom_histogram(data = lowf0, fill = "red", alpha = 0.2) +
geom_histogram(data = mediumf0, fill = "blue", alpha = 0.2) +
geom_histogram(data = highf0, fill = "green", alpha = 0.2) +
Here's a concrete example with some output:
dat <- data.frame(xx = c(runif(100,20,50),runif(100,40,80),runif(100,0,30)),yy = rep(letters[1:3],each = 100))
ggplot(dat,aes(x=xx)) +
geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)
which produces something like this:
Edited to fix typos; you wanted fill, not colour.
While only a few lines are required to plot multiple/overlapping histograms in ggplot2, the results are't always satisfactory. There needs to be proper use of borders and coloring to ensure the eye can differentiate between histograms.
The following functions balance border colors, opacities, and superimposed density plots to enable the viewer to differentiate among distributions.
Single histogram:
plot_histogram <- function(df, feature) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)))) +
geom_histogram(aes(y = ..density..), alpha=0.7, fill="#33AADE", color="black") +
geom_density(alpha=0.3, fill="red") +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
print(plt)
}
Multiple histogram:
plot_multi_histogram <- function(df, feature, label_column) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
Usage:
Simply pass your data frame into the above functions along with desired arguments:
plot_histogram(iris, 'Sepal.Width')
plot_multi_histogram(iris, 'Sepal.Width', 'Species')
The extra parameter in plot_multi_histogram is the name of the column containing the category labels.
We can see this more dramatically by creating a dataframe with many different distribution means:
a <-data.frame(n=rnorm(1000, mean = 1), category=rep('A', 1000))
b <-data.frame(n=rnorm(1000, mean = 2), category=rep('B', 1000))
c <-data.frame(n=rnorm(1000, mean = 3), category=rep('C', 1000))
d <-data.frame(n=rnorm(1000, mean = 4), category=rep('D', 1000))
e <-data.frame(n=rnorm(1000, mean = 5), category=rep('E', 1000))
f <-data.frame(n=rnorm(1000, mean = 6), category=rep('F', 1000))
many_distros <- do.call('rbind', list(a,b,c,d,e,f))
Passing data frame in as before (and widening chart using options):
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, 'n', 'category')
To add a separate vertical line for each distribution:
plot_multi_histogram <- function(df, feature, label_column, means) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(xintercept=means, color="black", linetype="dashed", size=1)
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
The only change over the previous plot_multi_histogram function is the addition of means to the parameters, and changing the geom_vline line to accept multiple values.
Usage:
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, "n", 'category', c(1, 2, 3, 4, 5, 6))
Result:
Since I set the means explicitly in many_distros I can simply pass them in. Alternatively you can simply calculate these inside the function and use that way.
Related
I would like to plot a graph from a Discriminant Function Analysis in which points must have a black border and be filled with specific colors and confidence ellipses must be the same color as the points are filled. Using the following code, I get almost the graph I want, except that points do not have a black border:
library(ggplot2)
library(ggord)
library(MASS)
data("iris")
set.seed(123)
linear <- lda(Species~., iris)
linear
dfaplot <- ggord(linear, iris$Species, labcol = "transparent", arrow = NULL, poly = FALSE, ylim = c(-11, 11), xlim = c(-11, 11))
dfaplot +
scale_shape_manual(values = c(16,15,17)) +
scale_color_manual(values = c("#00FF00","#FF00FF","#0000FF")) +
theme(legend.position = "none")
PLOT 1
I could put a black border on the points by using the following code, but then confidence ellipses turn black.
dfaplot +
scale_shape_manual(values = c(21,22,24)) +
scale_color_manual(values = c("black","black","black")) +
scale_fill_manual(values = c("#00FF00","#FF00FF","#0000FF")) +
theme(legend.position = "none")
PLOT 2
I would like to keep the ellipses as in the first graph, but the points as in the second one. However, I am being unable to figure out how I could do this. If anyone has suggestions on how to do this, I would be very grateful. I am using the "ggord" package because I learned how to run the analysis using it, but if anyone has suggestions on how to do the same with only ggplot, it would be fine.
This roughly replicates what is going on in ggord. Looking at the source for the package, the ellipses are implemented differently in ggord than below, hence the small differences. If that is a big deal you can review the source and make changes. By default, geom_point doesn't have a fill attribute. So we set the shapes to a character type that does, and then specify color = 'black' in geom_point(). The full code (including projecting the original data) is below.
set.seed(123)
linear <- lda(Species~., iris)
linear
# Get point x, y coordinates
df <- data.frame(predict(linear, iris[, 1:4]))
df$species <- iris$Species
# Get explained variance for each axis
var_exp <- 100 * linear$svd ^ 2 / sum(linear$svd ^ 2)
ggplot(data = df,
aes(x = x.LD1,
y = x.LD2)) +
geom_point(aes(fill = species,
shape = species),
size = 4) +
stat_ellipse(aes(color = species),
level = 0.95) +
ylim(c(-11, 11)) +
xlim(c(-11, 11)) +
ylab(paste("LD2 (",
round(var_exp[2], 2),
"%)")) +
xlab(paste("LD1 (",
round(var_exp[1], 2),
"%)")) +
scale_color_manual(values = c("#00FF00","#FF00FF","#0000FF")) +
scale_fill_manual(values = c("#00FF00","#FF00FF","#0000FF")) +
scale_shape_manual(values = c(21, 22, 24)) +
coord_fixed(1) +
theme_bw() +
theme(
legend.position = "none"
)
To plot arrows, you can grab the scaling from the output it and plot it with geom_segment. I played with the colors/alpha so they were visible in the plot below.
scaling <- data.frame(linear$scaling)
...
geom_segment(data = scaling,
aes(x = 0,
y = 0,
xend = LD1,
yend = LD2),
arrow = arrow(),
color = "black") +
geom_text(data = scaling,
aes(x = ifelse(LD1 <= 0.1, LD1 - 2, LD1 + 2),
y = ifelse(LD2 <= 0.1, LD2 - 1, LD2 + 1)),
label = rownames(scaling),
color = "black") +
...
I am working with the R programming language. I made the following graph that shows a scatterplot between points of two different colors :
library(ggplot2)
a = rnorm(10000,10,10)
b = rnorm(10000, 10, 10)
c = as.factor("red")
data_1 = data.frame(a,b,c)
a = rnorm(10000,7,5)
b = rnorm(10000, 7, 5)
c = as.factor("blue")
data_2 = data.frame(a,b,c)
final = rbind(data_1, data_2)
my_plot = ggplot(final, aes(x=a, y=b, col = c)) + geom_point() + theme(legend.position="top") + ggtitle("My Plot")
My Question: Is there a way to "change the colors of overlapping points"?
Here is what I tried so far:
1) I found the following question (Visualizing two or more data points where they overlap (ggplot R)) and tried the strategy suggested:
linecolors <- c("#714C02", "#01587A", "#024E37")
fillcolors <- c("#9D6C06", "#077DAA", "#026D4E")
# partially transparent points by setting `alpha = 0.5`
ggplot(final, aes(a,b, colour = c, fill = c)) +
geom_point(alpha = 0.5) +
scale_color_manual(values=linecolors) +
scale_fill_manual(values=fillcolors) +
theme_bw()
This shows the two different colors along with the overlap, but it is quite dark and still not clear. Is there a way to pick better colors/resolutions for this?
2) I found the following link which shows how to make color gradients for continuous variables : https://drsimonj.svbtle.com/pretty-scatter-plots-with-ggplot2 - but I have discrete colors and I do not know how to apply this
3) I found this question over here (Any way to make plot points in scatterplot more transparent in R?) which shows to do this with the base R plot, but not with ggplot2:
addTrans <- function(color,trans)
{
# This function adds transparancy to a color.
# Define transparancy with an integer between 0 and 255
# 0 being fully transparant and 255 being fully visable
# Works with either color and trans a vector of equal length,
# or one of the two of length 1.
if (length(color)!=length(trans)&!any(c(length(color),length(trans))==1)) stop("Vector lengths not correct")
if (length(color)==1 & length(trans)>1) color <- rep(color,length(trans))
if (length(trans)==1 & length(color)>1) trans <- rep(trans,length(color))
num2hex <- function(x)
{
hex <- unlist(strsplit("0123456789ABCDEF",split=""))
return(paste(hex[(x-x%%16)/16+1],hex[x%%16+1],sep=""))
}
rgb <- rbind(col2rgb(color),trans)
res <- paste("#",apply(apply(rgb,2,num2hex),2,paste,collapse=""),sep="")
return(res)
}
cols <- sample(c("red","green","pink"),100,TRUE)
# Very transparant:
plot(final$a , final$b ,col=addTrans(cols,100),pch=16,cex=1)
But this is also not able to differentiate between the two color classes that I have.
Problem: Can someone please suggest how to fix the problem with overlapping points, such that the overlap appear more visible?
Thanks!
I would use a density heatmap
ggplot(final, aes(x=a, y=b, col = c))+
stat_density_2d(aes(fill = stat(density)), geom = 'raster', contour = FALSE) +
scale_fill_viridis_c() +
coord_cartesian(expand = FALSE) +
geom_point(shape = '.', col = 'white')
or
ggplot(final, aes(x=a, y=b, col = c))+
stat_density_2d(aes(fill = stat(level)), geom = 'polygon') +
scale_fill_viridis_c(name = "density") +
geom_point(shape = '.')
or
ggplot(final, aes(x=a, y=b, col = c))+
geom_point(alpha = 0.1) +
geom_rug(alpha = 0.01)
I'm trying to plot box plots with normal distribution of the underlying data next to the plots in a vertical format like this:
This is what I currently have graphed from an excel sheet uploaded to R:
And the code associated with them:
set.seed(12345)
library(ggplot2)
library(ggthemes)
library(ggbeeswarm)
#graphing boxplot and quasirandom scatterplot together
ggplot(X8_17_20_R_20_60, aes(Type, Diameter)) +
geom_quasirandom(shape=20, fill="gray", color = "gray") +
geom_boxplot(fill="NA", color = c("red4", "orchid4", "dark green", "blue"),
outlier.color = "NA") +
theme_hc()
Is this possible in ggplot2 or R in general? Or is the only way this would be feasible is through something like OrignLab (where the first picture came from)?
You can do something similar to your example plot with the gghalves package:
library(gghalves)
n=0.02
ggplot(iris, aes(Species, Sepal.Length)) +
geom_half_boxplot(center=TRUE, errorbar.draw=FALSE,
width=0.5, nudge=n) +
geom_half_violin(side="r", nudge=n) +
geom_half_dotplot(dotsize=0.5, alpha=0.3, fill="red",
position=position_nudge(x=n, y=0)) +
theme_hc()
There are a few ways to do this. To gain full control over the look of the plot, I would just calculate the curves and plot them. Here's some sample data that's close to your own and shares the same names, so it should be directly applicable:
set.seed(12345)
X8_17_20_R_20_60 <- data.frame(
Diameter = rnorm(4000, rep(c(41, 40, 42, 40), each = 1000), sd = 6),
Type = rep(c("AvgFeret", "CalcDiameter", "Feret", "MinFeret"), each = 1000))
Now we create a little data frame of normal distributions based on the parameters taken from each group:
df <- do.call(rbind, mapply( function(d, n) {
y <- seq(min(d), max(d), length.out = 1000)
data.frame(x = n - 5 * dnorm(y, mean(d), sd(d)) - 0.15, y = y, z = n)
}, with(X8_17_20_R_20_60, split(Diameter, Type)), 1:4, SIMPLIFY = FALSE))
Finally, we draw your plot and add a geom_path with the new data.
library(ggplot2)
library(ggthemes)
library(ggbeeswarm)
ggplot(X8_17_20_R_20_60, aes(Type, Diameter)) +
geom_quasirandom(shape = 20, fill = "gray", color = "gray") +
geom_boxplot(fill="NA", aes(color = Type), outlier.color = "NA") +
scale_color_manual(values = c("red4", "orchid4", "dark green", "blue")) +
geom_path(data = df, aes(x = x, y = y, group = z), size = 1) +
theme_hc()
Created on 2020-08-21 by the reprex package (v0.3.0)
I'm working with stock prices and trying to plot the price difference.
I created one using autoplot.zoo(), my question is, how can I manage to change the point shapes to triangles when they are above the upper threshold and to circles when they are below the lower threshold. I understand that when using the basic plot() function you can do these by calling the points() function, wondering how I can do this but with ggplot2.
Here is the code for the plot:
p<-autoplot.zoo(data, geom = "line")+
geom_hline(yintercept = threshold, color="red")+
geom_hline(yintercept = -threshold, color="red")+
ggtitle("AAPL vs. SPY out of sample")
p+geom_point()
We can't fully replicate without your data, but here's an attempt with some sample generated data that should be similar enough that you can adapt for your purposes.
# Sample data
data = data.frame(date = c(2001:2020),
spread = runif(20, -10,10))
# Upper and lower threshold
thresh <- 4
You can create an additional variable that determines the shape, based on the relationship in the data itself, and pass that as an argument into ggplot.
# Create conditional data
data$outlier[data$spread > thresh] <- "Above"
data$outlier[data$spread < -thresh] <- "Below"
data$outlier[is.na(data$outlier)] <- "In Range"
library(ggplot2)
ggplot(data, aes(x = date, y = spread, shape = outlier, group = 1)) +
geom_line() +
geom_point() +
geom_hline(yintercept = c(thresh, -thresh), color = "red") +
scale_shape_manual(values = c(17,16,15))
# If you want points just above and below# Sample data
data = data.frame(date = c(2001:2020),
spread = runif(20, -10,10))
thresh <- 4
data$outlier[data$spread > thresh] <- "Above"
data$outlier[data$spread < -thresh] <- "Below"
ggplot(data, aes(x = date, y = spread, shape = outlier, group = 1)) +
geom_line() +
geom_point() +
geom_hline(yintercept = c(thresh, -thresh), color = "red") +
scale_shape_manual(values = c(17,16))
Alternatively, you can just add the points above and below the threshold as individual layers with manually specified shapes, like this. The pch argument points to shape type.
# Another way of doing this
data = data.frame(date = c(2001:2020),
spread = runif(20, -10,10))
# Upper and lower threshold
thresh <- 4
ggplot(data, aes(x = date, y = spread, group = 1)) +
geom_line() +
geom_point(data = data[data$spread>thresh,], pch = 17) +
geom_point(data = data[data$spread< (-thresh),], pch = 16) +
geom_hline(yintercept = c(thresh, -thresh), color = "red") +
scale_shape_manual(values = c(17,16))
I am plotting a box-plot to see the distribution of the variable. I am also interested in seeing the number of observations in each quartile. Is there any way to add the number of observations in each quartile to the boxplot along with the values of quartiles?
I included some code below which can generate box-plot with the values of quartiles.
df <- datasets::iris
boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label_repel", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.1), size = 3) +
ggtitle("") +
xlab("") +
ylab('Sepal.Length')
I expect the values of quartiles on the left-hand side of the plot and the number of observations on the right-hand side of the plot if possible.
this would be one possibility. I always prefer to have my additional data as an extra data frame, because this gives me more control on what is how calculated.
Counting made with some inspiration from https://stackoverflow.com/a/54451575
quantile_counts=function(x){
df= data.frame(label=table(cut(x, quantile(x))),
label_pos=diff(quantile(x))/2+quantile(x)[1:4])
return(df)
}
df_quantile_counts=quantile_counts(df$Sepal.Length)
boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.1), size = 3) +
geom_text(data=df_quantile_counts,aes(x="",y=label_pos,label = label.Freq),
position = position_nudge(x = +0.1), size = 3) +
ggtitle("") +
xlab("") +
ylab('Sepal.Length')
HTH, Tobi
#TobiO 's answer is correct. But, my data was kind of skewed and some cut points were the same (such as the first and second cut points were the same). I needed to take the unique values to calculate the number of observations in each quartile. Another point is related to usage of cut function which does not include the starting point (low bound, high bound]. In order to include the starting point, I have used the cut2 function from the Hmisc package. I included a label_pos_extension line in order to prevent the overlap of label/text for the quartiles whose cut points are very close to each other. geom_text_repel did not work for preventing the overlaps.
quantile_counts2 <- function(x){
label_pos_extension <- c(0,3,4,0)
if(length(unique(quantile(x))) < 5){
df <- data.frame(label = table(cut2(x, g = 4)),
label_pos = c(0, diff(unique(quantile(x))) / 2 + quantile(x)[1:length(unique(quantile(x)))-1]) + label_pos_extension[1:length(unique(quantile(x)))])
} else {
df <- data.frame(label = table(cut2(x, g = 4)),
label_pos = diff(quantile(x)) / 2 + quantile(x)[1:4] + label_pos_extension)
} return(df)
}
PS. I tried to put my edited function in comment but, it did not work.