How to plot filled points and confidence ellipses with the same color using ggplot in R? - r

I would like to plot a graph from a Discriminant Function Analysis in which points must have a black border and be filled with specific colors and confidence ellipses must be the same color as the points are filled. Using the following code, I get almost the graph I want, except that points do not have a black border:
library(ggplot2)
library(ggord)
library(MASS)
data("iris")
set.seed(123)
linear <- lda(Species~., iris)
linear
dfaplot <- ggord(linear, iris$Species, labcol = "transparent", arrow = NULL, poly = FALSE, ylim = c(-11, 11), xlim = c(-11, 11))
dfaplot +
scale_shape_manual(values = c(16,15,17)) +
scale_color_manual(values = c("#00FF00","#FF00FF","#0000FF")) +
theme(legend.position = "none")
PLOT 1
I could put a black border on the points by using the following code, but then confidence ellipses turn black.
dfaplot +
scale_shape_manual(values = c(21,22,24)) +
scale_color_manual(values = c("black","black","black")) +
scale_fill_manual(values = c("#00FF00","#FF00FF","#0000FF")) +
theme(legend.position = "none")
PLOT 2
I would like to keep the ellipses as in the first graph, but the points as in the second one. However, I am being unable to figure out how I could do this. If anyone has suggestions on how to do this, I would be very grateful. I am using the "ggord" package because I learned how to run the analysis using it, but if anyone has suggestions on how to do the same with only ggplot, it would be fine.

This roughly replicates what is going on in ggord. Looking at the source for the package, the ellipses are implemented differently in ggord than below, hence the small differences. If that is a big deal you can review the source and make changes. By default, geom_point doesn't have a fill attribute. So we set the shapes to a character type that does, and then specify color = 'black' in geom_point(). The full code (including projecting the original data) is below.
set.seed(123)
linear <- lda(Species~., iris)
linear
# Get point x, y coordinates
df <- data.frame(predict(linear, iris[, 1:4]))
df$species <- iris$Species
# Get explained variance for each axis
var_exp <- 100 * linear$svd ^ 2 / sum(linear$svd ^ 2)
ggplot(data = df,
aes(x = x.LD1,
y = x.LD2)) +
geom_point(aes(fill = species,
shape = species),
size = 4) +
stat_ellipse(aes(color = species),
level = 0.95) +
ylim(c(-11, 11)) +
xlim(c(-11, 11)) +
ylab(paste("LD2 (",
round(var_exp[2], 2),
"%)")) +
xlab(paste("LD1 (",
round(var_exp[1], 2),
"%)")) +
scale_color_manual(values = c("#00FF00","#FF00FF","#0000FF")) +
scale_fill_manual(values = c("#00FF00","#FF00FF","#0000FF")) +
scale_shape_manual(values = c(21, 22, 24)) +
coord_fixed(1) +
theme_bw() +
theme(
legend.position = "none"
)
To plot arrows, you can grab the scaling from the output it and plot it with geom_segment. I played with the colors/alpha so they were visible in the plot below.
scaling <- data.frame(linear$scaling)
...
geom_segment(data = scaling,
aes(x = 0,
y = 0,
xend = LD1,
yend = LD2),
arrow = arrow(),
color = "black") +
geom_text(data = scaling,
aes(x = ifelse(LD1 <= 0.1, LD1 - 2, LD1 + 2),
y = ifelse(LD2 <= 0.1, LD2 - 1, LD2 + 1)),
label = rownames(scaling),
color = "black") +
...

Related

ggplot2: Projecting points or distribution on a non-orthogonal (eg, -45 degree) axis

The figure below is a conceptual diagram used by Michael Clark,
https://m-clark.github.io/docs/lord/index.html
to explain Lord's Paradox and related phenomena in regression.
My question is framed in this context and using ggplot2 but it is broader in terms of geometry & graphing.
I would like to reproduce figures like this, but using actual data. I need to know:
how to draw a new axis at the origin, with a -45 degree angle, corresponding to values of y-x
how to draw little normal distributions or density diagrams, or other representations of the values y-x projected onto this axis.
My minimal base example uses ggplot2,
library(ggplot2)
set.seed(1234)
N <- 200
group <- rep(c(0, 1), each = N/2)
initial <- .75*group + rnorm(N, sd=.25)
final <- .4*initial + .5*group + rnorm(N, sd=.1)
change <- final - initial
df <- data.frame(id = factor(1:N),
group = factor(group,
labels = c('Female', 'Male')),
initial,
final,
change)
#head(df)
#' plot, with regression lines and data ellipses
ggplot(df, aes(x = initial, y = final, color = group)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
stat_ellipse(size = 1.2) +
geom_abline(slope = 1, color = "black", size = 1.2) +
coord_fixed(xlim = c(-.6, 1.2), ylim = c(-.6, 1.2)) +
theme_bw() +
theme(legend.position = c(.15, .85))
This gives the following graph:
In geometry, the coordinates of the -45 degree rotated axes of distributions I want to portray are
(y-x), (x+y) in the original space of the plot. But how can I draw these with
ggplot2 or other software?
An accepted solution can be vague about how the distribution of (y-x) is represented,
but should solve the problem of how to display this on a (y-x) axis.
Fun question! I haven't encountered it yet, but there might be a package to help do this automatically. Here's a manual approach using two hacks:
the clip = "off" parameter of the coord_* functions, to allow us to add annotations outside the plot area.
building a density plot, extracting its coordinates, and then rotating and translating those.
First, we can make a density plot of the change from initial to final, seeing a left skewed distribution:
(my_hist <- df %>%
mutate(gain = final - initial) %>% # gain would be better name
ggplot(aes(gain)) +
geom_density())
Now we can extract the guts of that plot, and transform the coordinates to where we want them to appear in the combined plot:
a <- ggplot_build(my_hist)
rot = pi * 3/4
diag_hist <- tibble(
x = a[["data"]][[1]][["x"]],
y = a[["data"]][[1]][["y"]]
) %>%
# squish
mutate(y = y*0.2) %>%
# rotate 135 deg CCW
mutate(xy = x*cos(rot) - y*sin(rot),
dens = x*sin(rot) + y*cos(rot)) %>%
# slide
mutate(xy = xy - 0.7, # magic number based on plot range below
dens = dens - 0.7)
And here's a combination with the original plot:
ggplot(df, aes(x = initial, y = final, color = group)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
stat_ellipse(size = 1.2) +
geom_abline(slope = 1, color = "black", size = 1.2) +
coord_fixed(clip = "off",
xlim = c(-0.7,1.6),
ylim = c(-0.7,1.6),
expand = expansion(0)) +
annotate("segment", x = -1.4, xend = 0, y = 0, yend = -1.4) +
annotate("path", x = diag_hist$xy, y = diag_hist$dens) +
theme_bw() +
theme(legend.position = c(.15, .85),
plot.margin = unit(c(.1,.1,2,2), "cm"))

R: Changing the Color of Overlapping Points

I am working with the R programming language. I made the following graph that shows a scatterplot between points of two different colors :
library(ggplot2)
a = rnorm(10000,10,10)
b = rnorm(10000, 10, 10)
c = as.factor("red")
data_1 = data.frame(a,b,c)
a = rnorm(10000,7,5)
b = rnorm(10000, 7, 5)
c = as.factor("blue")
data_2 = data.frame(a,b,c)
final = rbind(data_1, data_2)
my_plot = ggplot(final, aes(x=a, y=b, col = c)) + geom_point() + theme(legend.position="top") + ggtitle("My Plot")
My Question: Is there a way to "change the colors of overlapping points"?
Here is what I tried so far:
1) I found the following question (Visualizing two or more data points where they overlap (ggplot R)) and tried the strategy suggested:
linecolors <- c("#714C02", "#01587A", "#024E37")
fillcolors <- c("#9D6C06", "#077DAA", "#026D4E")
# partially transparent points by setting `alpha = 0.5`
ggplot(final, aes(a,b, colour = c, fill = c)) +
geom_point(alpha = 0.5) +
scale_color_manual(values=linecolors) +
scale_fill_manual(values=fillcolors) +
theme_bw()
This shows the two different colors along with the overlap, but it is quite dark and still not clear. Is there a way to pick better colors/resolutions for this?
2) I found the following link which shows how to make color gradients for continuous variables : https://drsimonj.svbtle.com/pretty-scatter-plots-with-ggplot2 - but I have discrete colors and I do not know how to apply this
3) I found this question over here (Any way to make plot points in scatterplot more transparent in R?) which shows to do this with the base R plot, but not with ggplot2:
addTrans <- function(color,trans)
{
# This function adds transparancy to a color.
# Define transparancy with an integer between 0 and 255
# 0 being fully transparant and 255 being fully visable
# Works with either color and trans a vector of equal length,
# or one of the two of length 1.
if (length(color)!=length(trans)&!any(c(length(color),length(trans))==1)) stop("Vector lengths not correct")
if (length(color)==1 & length(trans)>1) color <- rep(color,length(trans))
if (length(trans)==1 & length(color)>1) trans <- rep(trans,length(color))
num2hex <- function(x)
{
hex <- unlist(strsplit("0123456789ABCDEF",split=""))
return(paste(hex[(x-x%%16)/16+1],hex[x%%16+1],sep=""))
}
rgb <- rbind(col2rgb(color),trans)
res <- paste("#",apply(apply(rgb,2,num2hex),2,paste,collapse=""),sep="")
return(res)
}
cols <- sample(c("red","green","pink"),100,TRUE)
# Very transparant:
plot(final$a , final$b ,col=addTrans(cols,100),pch=16,cex=1)
But this is also not able to differentiate between the two color classes that I have.
Problem: Can someone please suggest how to fix the problem with overlapping points, such that the overlap appear more visible?
Thanks!
I would use a density heatmap
ggplot(final, aes(x=a, y=b, col = c))+
stat_density_2d(aes(fill = stat(density)), geom = 'raster', contour = FALSE) +
scale_fill_viridis_c() +
coord_cartesian(expand = FALSE) +
geom_point(shape = '.', col = 'white')
or
ggplot(final, aes(x=a, y=b, col = c))+
stat_density_2d(aes(fill = stat(level)), geom = 'polygon') +
scale_fill_viridis_c(name = "density") +
geom_point(shape = '.')
or
ggplot(final, aes(x=a, y=b, col = c))+
geom_point(alpha = 0.1) +
geom_rug(alpha = 0.01)

Is there an equivalent to points() on ggplot2

I'm working with stock prices and trying to plot the price difference.
I created one using autoplot.zoo(), my question is, how can I manage to change the point shapes to triangles when they are above the upper threshold and to circles when they are below the lower threshold. I understand that when using the basic plot() function you can do these by calling the points() function, wondering how I can do this but with ggplot2.
Here is the code for the plot:
p<-autoplot.zoo(data, geom = "line")+
geom_hline(yintercept = threshold, color="red")+
geom_hline(yintercept = -threshold, color="red")+
ggtitle("AAPL vs. SPY out of sample")
p+geom_point()
We can't fully replicate without your data, but here's an attempt with some sample generated data that should be similar enough that you can adapt for your purposes.
# Sample data
data = data.frame(date = c(2001:2020),
spread = runif(20, -10,10))
# Upper and lower threshold
thresh <- 4
You can create an additional variable that determines the shape, based on the relationship in the data itself, and pass that as an argument into ggplot.
# Create conditional data
data$outlier[data$spread > thresh] <- "Above"
data$outlier[data$spread < -thresh] <- "Below"
data$outlier[is.na(data$outlier)] <- "In Range"
library(ggplot2)
ggplot(data, aes(x = date, y = spread, shape = outlier, group = 1)) +
geom_line() +
geom_point() +
geom_hline(yintercept = c(thresh, -thresh), color = "red") +
scale_shape_manual(values = c(17,16,15))
# If you want points just above and below# Sample data
data = data.frame(date = c(2001:2020),
spread = runif(20, -10,10))
thresh <- 4
data$outlier[data$spread > thresh] <- "Above"
data$outlier[data$spread < -thresh] <- "Below"
ggplot(data, aes(x = date, y = spread, shape = outlier, group = 1)) +
geom_line() +
geom_point() +
geom_hline(yintercept = c(thresh, -thresh), color = "red") +
scale_shape_manual(values = c(17,16))
Alternatively, you can just add the points above and below the threshold as individual layers with manually specified shapes, like this. The pch argument points to shape type.
# Another way of doing this
data = data.frame(date = c(2001:2020),
spread = runif(20, -10,10))
# Upper and lower threshold
thresh <- 4
ggplot(data, aes(x = date, y = spread, group = 1)) +
geom_line() +
geom_point(data = data[data$spread>thresh,], pch = 17) +
geom_point(data = data[data$spread< (-thresh),], pch = 16) +
geom_hline(yintercept = c(thresh, -thresh), color = "red") +
scale_shape_manual(values = c(17,16))

row column heatmap plot with overlayed circle (fill and size) in r

Here is a graph I am trying to develop:
I have row and column coordinate variables, also three quatitative variables (rectheat = to fill the rectangle heatmap,circlesize = size of circles, circlefill = fill color heatmap). NA should be missing represented by a different color (for example gray color).
The following is data:
set.seed (1234)
rectheat = sample(c(rnorm (10, 5,1), NA, NA), 7*14, replace = T)
dataf <- data.frame (rowv = rep (1:7, 14), columnv = rep(1:14, each = 7),
rectheat, circlesize = rectheat*1.5,
circlefill = rectheat*10 )
dataf
Here is code that I worked on:
require(ggplot2)
ggplot(dataf, aes(y = factor(rowv),x = factor(columnv))) +
geom_rect(aes(colour = rectheat)) +
geom_point(aes(colour = circlefill, size =circlesize)) + theme_bw()
I am not sure if geom_rect is appropriate and other part is fine as I could not get any results except errors.
Here it is better to use geom_tile (heatmap).
require(ggplot2)
ggplot(dataf, aes(y = factor(rowv),
x = factor(columnv))) + ## global aes
geom_tile(aes(fill = rectheat)) + ## to get the rect filled
geom_point(aes(colour = circlefill,
size =circlesize)) + ## geom_point for circle illusion
scale_color_gradient(low = "yellow",
high = "red")+ ## color of the corresponding aes
scale_size(range = c(1, 20))+ ## to tune the size of circles
theme_bw()

Overlaying histograms with ggplot2 in R

I am new to R and am trying to plot 3 histograms onto the same graph.
Everything worked fine, but my problem is that you don't see where 2 histograms overlap - they look rather cut off.
When I make density plots, it looks perfect: each curve is surrounded by a black frame line, and colours look different where curves overlap.
Can someone tell me if something similar can be achieved with the histograms in the 1st picture? This is the code I'm using:
lowf0 <-read.csv (....)
mediumf0 <-read.csv (....)
highf0 <-read.csv(....)
lowf0$utt<-'low f0'
mediumf0$utt<-'medium f0'
highf0$utt<-'high f0'
histogram<-rbind(lowf0,mediumf0,highf0)
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
Using #joran's sample data,
ggplot(dat, aes(x=xx, fill=yy)) + geom_histogram(alpha=0.2, position="identity")
note that the default position of geom_histogram is "stack."
see "position adjustment" of this page:
geom_histogram documentation
Your current code:
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
is telling ggplot to construct one histogram using all the values in f0 and then color the bars of this single histogram according to the variable utt.
What you want instead is to create three separate histograms, with alpha blending so that they are visible through each other. So you probably want to use three separate calls to geom_histogram, where each one gets it's own data frame and fill:
ggplot(histogram, aes(f0)) +
geom_histogram(data = lowf0, fill = "red", alpha = 0.2) +
geom_histogram(data = mediumf0, fill = "blue", alpha = 0.2) +
geom_histogram(data = highf0, fill = "green", alpha = 0.2) +
Here's a concrete example with some output:
dat <- data.frame(xx = c(runif(100,20,50),runif(100,40,80),runif(100,0,30)),yy = rep(letters[1:3],each = 100))
ggplot(dat,aes(x=xx)) +
geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)
which produces something like this:
Edited to fix typos; you wanted fill, not colour.
While only a few lines are required to plot multiple/overlapping histograms in ggplot2, the results are't always satisfactory. There needs to be proper use of borders and coloring to ensure the eye can differentiate between histograms.
The following functions balance border colors, opacities, and superimposed density plots to enable the viewer to differentiate among distributions.
Single histogram:
plot_histogram <- function(df, feature) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)))) +
geom_histogram(aes(y = ..density..), alpha=0.7, fill="#33AADE", color="black") +
geom_density(alpha=0.3, fill="red") +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
print(plt)
}
Multiple histogram:
plot_multi_histogram <- function(df, feature, label_column) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
Usage:
Simply pass your data frame into the above functions along with desired arguments:
plot_histogram(iris, 'Sepal.Width')
plot_multi_histogram(iris, 'Sepal.Width', 'Species')
The extra parameter in plot_multi_histogram is the name of the column containing the category labels.
We can see this more dramatically by creating a dataframe with many different distribution means:
a <-data.frame(n=rnorm(1000, mean = 1), category=rep('A', 1000))
b <-data.frame(n=rnorm(1000, mean = 2), category=rep('B', 1000))
c <-data.frame(n=rnorm(1000, mean = 3), category=rep('C', 1000))
d <-data.frame(n=rnorm(1000, mean = 4), category=rep('D', 1000))
e <-data.frame(n=rnorm(1000, mean = 5), category=rep('E', 1000))
f <-data.frame(n=rnorm(1000, mean = 6), category=rep('F', 1000))
many_distros <- do.call('rbind', list(a,b,c,d,e,f))
Passing data frame in as before (and widening chart using options):
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, 'n', 'category')
To add a separate vertical line for each distribution:
plot_multi_histogram <- function(df, feature, label_column, means) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(xintercept=means, color="black", linetype="dashed", size=1)
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
The only change over the previous plot_multi_histogram function is the addition of means to the parameters, and changing the geom_vline line to accept multiple values.
Usage:
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, "n", 'category', c(1, 2, 3, 4, 5, 6))
Result:
Since I set the means explicitly in many_distros I can simply pass them in. Alternatively you can simply calculate these inside the function and use that way.

Resources