how to distinguish 4 factors in ggplot2? - r

How does one distinguish 4 different factors (not using size)? Is it possible to use hollow and solid points to distinguish a variable in ggplot2?
test=data.frame(x=runif(12,0,1),
y=runif(12,0,1),
siteloc=as.factor(c('a','b','a','b','a','b','a','b','a','b','a','b')),
modeltype=as.factor(c('q','r','s','q','r','s','q','r','s','q','r','s')),
mth=c('Mar','Apr','May','Mar','Apr','May','Mar','Apr','May','Mar','Apr','May'),
yr=c(2010,2011,2010,2011,2010,2011,2010,2011,2010,2011,2010,2011))
where x are observations and y are modeling results and I want to compare different model versions across several factors. Thanks!

I think , it very difficult visually to distinguish/compare x and y values according to 4 factors. I would use faceting and I reduce the number of factors using interaction for example.
Here an example using geom_bar:
set.seed(10)
library(reshape2)
test.m <- melt(test,measure.vars=c('x','y'))
ggplot(test.m)+
geom_bar(aes(x=interaction(yr,mth),y=value,
fill=variable),stat='identity',position='dodge')+
facet_grid(modeltype~siteloc)

I really like using interaction by agstudy - I would probably try this first. But if keeping things unchanged then:
4 factors could be accomodated with faceting and 2 axes. Then there are 2 metrics x and y: one option is a bubble chart with both metrics distinguishing by color or shape or both (added jitter to make shapes less overlapping):
testm = melt(test, id=c('siteloc', 'modeltype', 'mth', 'yr'))
# by color
ggplot(testm, aes(x=siteloc, y=modeltype, size=value, colour=variable)) +
geom_point(shape=21, position="jitter") +
facet_grid(mth~yr) +
scale_size_area(max_size=40) +
scale_shape(solid=FALSE) +
theme_bw()
#by shape
testm$shape = as.factor(with(testm, ifelse(variable=='x', 21, 25)))
ggplot(testm, aes(x=siteloc, y=modeltype, size=value, shape=shape)) +
geom_point(position="jitter") +
facet_grid(mth~yr) +
scale_size_area(max_size=40) +
scale_shape(solid=FALSE) +
theme_bw()
# by shape and color
ggplot(testm, aes(x=siteloc, y=modeltype, size=value, colour=variable, shape=shape)) +
geom_point(position="jitter") +
facet_grid(mth~yr) +
scale_size_area(max_size=40) +
scale_shape(solid=FALSE) +
theme_bw()
UPDATE:
This is attempt based on 1st comment by Dominik to show if (x,y) is above or below 1:1 line and how big is the ratio x/y or y/x - blue triangle is if x/y>1, red circle otherwise (no need in melt in this case):
test$shape = as.factor(with(test, ifelse(x/y>1, 25, 21)))
test$ratio = with(test, ifelse(x/y>1, x/y, y/x))
ggplot(test, aes(x=siteloc, y=modeltype, size=ratio, colour=shape, shape=shape)) +
geom_point() +
facet_grid(mth~yr) +
scale_size_area(max_size=40) +
scale_shape(solid=FALSE) +
theme_bw()

You can use hollow and solid points, but only with certain shapes as described in this answer.
So, that leaves you with fill, colour, shape, and alpha as your aesthetic mappings. It looks ugly, but here it is:
ggplot(test, aes(x, y,
fill=modeltype,
shape=siteloc,
colour=mth,
alpha=factor(yr)
)) +
geom_point(size = 4) +
scale_shape_manual(values=21:25) +
scale_alpha_manual(values=c(0.35,1))
Ugly, but I guess it is what you asked for. (I haven't bothered to figure out what is happening with the legend -- it obviously isn't displaying the borders right.)
If you want to map a variable to a kind of custom aesthetic (hollow and solid), you'll have to go a little further:
test$fill.type<-ifelse(test$yr==2010,'other',as.character(test$mth))
cols<-c('red','green','blue')
ggplot(test, aes(x, y,
shape=modeltype,
alpha=siteloc,
colour=mth,
fill=fill.type
)) +
geom_point(size = 10) +
scale_shape_manual(values=21:25) +
scale_alpha_manual(values=c(1,0.5)) +
scale_colour_manual(values=cols) +
scale_fill_manual(values=c(cols,NA))
Still ugly, but it works. I don't know a cleaner way of mapping both the yr to one colour if it is 2010 and the mth if not; I'd be happy if someone showed me a cleaner way to do that. And now the guides (legend) is totally wrong, but you can fix that manually.

Related

Assigning 40 shapes or more in scale_shape_manual

I have a data frame with more than 40 factor levels and I would like to assign different shapes for each level. However, as shown in the scale_shapes_table of ggplot2, I can assign only 26 of them and some !,",# and so on.
But I know that in python or jmp you can assign many shapes (like asteriks, left triangle ,right triangle, rectangle etc.). Is it also possible also in ggplot2?
data=data.frame(gr=seq(1,40), x1=runif(40), y1=runif(40))
library(ggplot2)
ggplot(data=data,aes(x=x1,y=y1,shape=factor(gr),col=factor(gr)))+
geom_point(alpha = 0.3,size=4,stroke=1.4) +
scale_shape_manual(values=c(0:40))
A large set of symbols is available using the emojifont package with Font Awasome (see the complete list here). More details are given here.
library(ggplot2)
library(emojifont)
set.seed(1234)
symbls <- c('fa-github', 'fa-binoculars', 'fa-twitter', 'fa-android', 'fa-coffee',
'fa-cube', 'fa-ambulance','fa-check','fa-cutlery','fa-cogs','fa-dot-circle-o','fa-car',
'fa-building','fa-fire', 'fa-flag','fa-female','fa-gratipay','fa-heart','fa-magnet',
'fa-lock','fa-map','fa-puzzle-piece','fa-shopping-cart','fa-star','fa-sticky-note',
'fa-stop-circle-o','fa-volume-down','fa-anchor', 'fa-beer','fa-book','fa-cloud',
'fa-comment','fa-eject','fa-chrome','fa-child','fa-bomb', 'fa-certificate',
'fa-desktop','fa-fire-extinguisher','fa-diamond')
idx <- order(symbls)
fa <- fontawesome(symbls)
k <- length(fa)
data=data.frame(gr=factor(fa, levels=fa[idx]), x1=runif(k), y1=runif(k))
data$gr <- factor(data$gr, levels=fa[idx])
ggplot(data, aes(x1, y1, colour=gr, label=gr)) +
xlab(NULL) + ylab(NULL) + geom_point(size=-1) +
geom_text(family='fontawesome-webfont', size=6, show.legend=FALSE) +
theme(legend.text=element_text(family='fontawesome-webfont')) +
scale_colour_discrete("Points",guide=guide_legend(override.aes=list(size=4)))
Warning: if you want to use the code in Rstudio, first reassign the graphing device as follows:
devtools::install_github("coatless/balamuta")
library("balamuta")
external_graphs()
Would using a combination of 5 or 10 distinct shapes with distinct colors sufficient to distinguish the 40 points work better? I see these as being visually easier to differentiate the 40 elements than using/resorting to unusual symbols.
ggplot(data=data,aes(x=x1,y=y1, shape=factor(gr), col=factor(gr)))+
geom_point(alpha = 0.5, size=4, stroke=1.4) +
scale_shape_manual(values=rep(c(0:2,5:6,9:10,11:12,14), times=4))
Or take advantage of the 5 unique shapes that take fill colors.
ggplot(data=data,aes(x=x1,y=y1, shape=factor(gr), fill=factor(gr), col=factor(gr)))+
geom_point(alpha = 0.5, size=4, stroke=1.4) +
scale_shape_manual(values=rep(c(21:25), times=8))
Maybe use gr as labels, using ggrepel, easier to find a number than comparing shapes:
library(ggrepel)
ggplot(data = data, aes(x = x1, y = y1, label = gr))+
geom_point() +
geom_label_repel()

`fill` scale is not shown in the legend

Here is my dummy code:
set.seed(1)
df <- data.frame(xx=sample(10,6),
yy=sample(10,6),
type2=c('a','b','a','a','b','b'),
type3=c('A','C','B','A','B','C')
)
ggplot(data=df, mapping = aes(x=xx, y=yy)) +
geom_point(aes(shape=type3, fill=type2), size=5) +
scale_shape_manual(values=c(24,25,21)) +
scale_fill_manual(values=c('green', 'red'))
Resulting plot has a legend but it's 'type2' section doesn't reflect scale of fill value - is it by design?
I know this is an old thread, but I ran into this exact problem and want to post this here for others like me. While the accepted answer works, the less risky, cleaner method is:
library(ggplot2)
ggplot(data=df, mapping = aes(x=xx, y=yy)) +
geom_point(aes(shape=type3, fill=type2), size=5) +
scale_shape_manual(values=c(24,25,21)) +
scale_fill_manual(values=c(a='green',b='red'))+
guides(fill=guide_legend(override.aes=list(shape=21)))
The key is to change the shape in the legend to one of those that can have a 'fill'.
Here's a different workaround.
library(ggplot2)
ggplot(data=df, mapping = aes(x=xx, y=yy)) +
geom_point(aes(shape=type3, fill=type2), size=5) +
scale_shape_manual(values=c(24,25,21)) +
scale_fill_manual(values=c(a='green',b='red'))+
guides(fill=guide_legend(override.aes=list(colour=c(a="green",b="red"))))
Using guide_legend(...) with override_aes is a way to influence the appearance of the guide (the legend). The hack is that here we are "overriding" the fill colors in the guide with the colors they should have had in the first place.
I played with the data and came up with this idea. I first assigned shape in the first geom_point. Then, I made the shapes empty. In this way, outlines stayed in black colour. Third, I manually assigned specific shape. Finally, I filled in the symbols.
ggplot(data=df, aes(x=xx, y=yy)) +
geom_point(aes(shape = type3), size = 5.1) + # Plot with three types of shape first
scale_shape(solid = FALSE) + # Make the shapes empty
scale_shape_manual(values=c(24,25,21)) + # Assign specific types of shape
geom_point(aes(color = type2, fill = type2, shape = type3), size = 4.5)
I'm not sure if what you want looks like this?
ggplot(df,aes(x=xx,y=yy))+
geom_point(aes(shape=type3,color=type2,fill=type2),size=5)+
scale_shape_manual(values=c(24,25,21))

Customising legend size-symbol items in ggplot2

I'm mapping size to a variable with something like a log distribution - mostly small values but a few very large ones. How can I make the legend display custom values in the low-value range? For example:
df = data.frame(x=rnorm(2000), y=rnorm(2000), v=abs(rnorm(2000)^5))
p = ggplot(df, aes(x, y)) +
geom_point(aes(col=v, size=v), alpha=0.75) +
scale_size_area(max_size = 10)
print(p)
I've tried p + guides(shape=guide_legend(override.aes=list(size=8))) solution posted in this SO question, but it makes no difference in my plot. In any case I'd like to use specific legend size values e.g. v = c(10,25,50,100,250,500) instead of the default range e.g. c(100,200,300,400)..
Grateful for assistance.
To get different break points of size in legend, modify scale_size_area() by adding argument breaks=. With breaks= you can set breakpoints at positions you need.
ggplot(df, aes(x, y)) +
geom_point(aes(col=v, size=v), alpha=0.75) +
scale_size_area(max_size = 10,breaks=c(10,25,50,100,250,500))

Changing the color in the legend with ggplot2 in R

I'm having two different problems with specifying the colors in my legends in ggplot. I've tried to make a simplified examples that shows my problem:
df <- data.frame(x=rep(1:9, 10), y=as.vector(t(aaply(1:10, 1, .fun=function(x){x:(x+8)}))), method=factor(rep(1:9, each=10)), DE=factor(rep(1:9, each=10)))
ggplot(df, aes(x, y, color=method, group=DE, linetype=DE)) + geom_smooth(stat="identity")
For some reason, the line types shown in the legend under the title DE are all blue. I'd like them to be black, but I have no idea why they're blue in the first place, so I'm not sure how to change them.
For my other problem, I'm trying to use both point color and point shape to show two different distinctions in my data. I'd like to have legends for both of these. Here's what I have:
classifiers <- c("KNN", "RF", "NB", "LR", "Tree")
des <- c("Uniform", "Gaussian", "KDE")
withoutDE <- c(.735, .710, .706, .628, .614, .720, .713, .532, .523, .557, .677, .641, .398, .507, .538)
withDE <- c(.769, .762, .758, .702, .707, .752, .745, .655, .721, .733, .775, .772, .749, .756, .759)
df <- data.frame(WithoutDE=withoutDE, WithDE=withDE, DE=rep(des, each=5), Classifier=rep(classifiers, 3))
df <- cbind(df, Method=paste(df$DE, df$Classifier, sep=""))
ggplot() + geom_point(data=df, aes(x=WithoutDE, y=WithDE, shape=Classifier, fill=DE), size=3) + ylim(0,1) + xlim(0,1) + xlab("AUC without DE") + ylab("AUC with DE") + scale_shape_manual(values=21:25) + scale_fill_manual(values=c("pink", "blue", "white"), labels=c("Uniform", "KDE", "Gaussian")) + theme(legend.position=c(.85,.3))
If I change the color to change as well as the fill (by putting color=DE into the aes), then those are visible in the legend. I like having the black border around the points, though. I'd just like to have the inside of the points in the legend reflect the point fill in the plot. (I'd also like to position the two legends side-by-side, but I really just want to get the color to work right now)
I've spent way too long googling about both of these problems and trying various solutions without any success. Does anyone have any idea what I'm doing wrong?
For question 1:
Give the legend for line type and the legend for colour the same name.
ggplot(df, aes(x, y, color=method, group=DE, linetype=DE)) +
geom_smooth(stat="identity") +
scale_color_discrete("Line") +
scale_linetype_discrete("Line")
For question 2:
I do not think your fills are matching your data. You should assign the name of the value to each colour in the scale_x_manual calls.
I couldn't get the black border for the points. Here is what I was able to get, though:
ggplot() +
geom_point(data=df, aes(x=WithoutDE, y=WithDE, shape=Classifier,
fill=DE, colour=DE), size=3) +
ylim(0,1) + xlim(0,1) +
xlab("AUC without DE") +
ylab("AUC with DE") +
scale_shape_manual(values=21:25) +
scale_fill_manual(values=c("Uniform"="pink", "KDE"="blue", "Gaussian"="white"),
guide="none") +
scale_colour_manual(values=c("Uniform"="pink", "KDE"="blue", "Gaussian"="white"),
labels=c("Uniform", "KDE", "Gaussian")) +
theme(legend.position=c(.85,.3))
I don't know if you can control the point type inside the legends. Maybe someone else with more knowledge of ggplot2 can figure it out.

Splitting distribution visualisations on the y-axis in ggplot2 in r

The most commonly cited example of how to visualize a logistic fit using ggplot2 seems to be something very much like this:
data("kyphosis", package="rpart")
ggplot(data=kyphosis, aes(x=Age, y = as.numeric(Kyphosis) - 1)) +
geom_point() +
stat_smooth(method="glm", family="binomial")
This visualisation works great if you don't have too much overlapping data, and the first suggestion for crowded data seems to be to use injected jitter in the x and y coordinates of the points then adjust the alpha value of the points. When you get to the point where individual points aren't useful but distributions of points are, is it possible to use geom_density(), geom_histogram(), or something else to visualise the data but continue to split the categorical variable along the y-axis as it is done with geom_point()?
From what I have found, geom_density() and geom_histogram() can easily be split/grouped by the categorical variable and both levels can easily be reversed using scale_y_reverse() but I can't figure out if it is even possible to move only one of the categorical variable distributions to the top of the plot. Any help/suggestions would be appreciated.
The annotate() function in ggplot allows you to add geoms to a plot with properties that "are not mapped from the variables of a data frame, but are instead in as vectors," meaning that you can add layers that are unrelated to your data frame. In this case your two density curves are related to the data frame (since the variables are in it), but because you're trying to position them differently, using annotate() is useful.
Here's one way to go about it:
data("kyphosis", package="rpart")
model.only <- ggplot(data=kyphosis, aes(x=Age, y = as.numeric(Kyphosis) - 1)) +
stat_smooth(method="glm", family="binomial")
absents <- subset(kyphosis, Kyphosis=="absent")
presents <- subset(kyphosis, Kyphosis=="present")
dens.absents <- density(absents$Age)
dens.presents <- density(presents$Age)
scaling.factor <- 10 # Make the density plots taller
model.only + annotate("line", x=dens.absents$x, y=dens.absents$y*scaling.factor) +
annotate("line", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1)
This adds two annotated layers with scaled density plots for each of the kyphosis groups. For the presents variable, y is scaled and increased by 1 to shift it up.
You can also fill the density plots instead of just using a line. Instead of annotate("line"...) you need to use annotate("polygon"...), like so:
model.only + annotate("polygon", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red", colour="black", alpha=0.4) +
annotate("polygon", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1, fill="green", colour="black", alpha=0.4)
Technically you could use annotate("density"...), but that won't work when you shift the present plot up by one. Instead of shifting, it fills the whole plot:
model.only + annotate("density", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red") +
annotate("density", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1, fill="green")
The only way around that problem is to use a polygon instead of a density geom.
One final variant: flipping the top density plot along y-axis = 1:
model.only + annotate("polygon", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red", colour="black", alpha=0.4) +
annotate("polygon", x=dens.presents$x, y=(1 - dens.presents$y*scaling.factor), fill="green", colour="black", alpha=0.4)
I am not sure I get your point, but here an attempt:
dat <- rbind(kyphosis,kyphosis)
dat$grp <- factor(rep(c('smooth','dens'),each = nrow(kyphosis)),
levels = c('smooth','dens'))
ggplot(dat,aes(x=Age)) +
facet_grid(grp~.,scales = "free_y") +
#geom_point(data=subset(dat,grp=='smooth'),aes(y = as.numeric(Kyphosis) - 1)) +
stat_smooth(data=subset(dat,grp=='smooth'),aes(y = as.numeric(Kyphosis) - 1),
method="glm", family="binomial") +
geom_density(data=subset(dat,grp=='dens'))

Resources