Consistent variable colours in factoextra across plots

Consistent variable colours in factoextra across plots - r

I am trying to keep my plots with consistent variable colours while using the factoextra library to plot PCA results. Reproducible example below:
data("decathlon2")
df <- decathlon2[1:23, 1:10]
library("FactoMineR")
res.pca <- PCA(df, graph = FALSE)
get_eig(res.pca)
# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)
# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)
I would like the plot for PC1 and PC2 to have a color palette with 10 colours which is identical across plots (i.e. x100m will be red in both). However, in my actual data-set I have 15 explanatory variables which seems to be above the limit for color brewer so there are 2 problems:
How to maintain consistent colour scheme
Be able to utilise 15 colors
Thank you in advance.

(I assume you already know you need to add fill = "name" to the fviz_contrib() call; otherwise the bars will default to fill = "steelblue".)
You can define the palette manually, such that each variable corresponds to the same colour.
To simulate the problem using the example in the question, suppose we only want to show the top 7, when there are 10 variables all together:
# naive way with 7-color palette applied to different variables
fviz_contrib(res.pca, choice = "var", fill = "name", color = "black", axes = 1, top = 7)
fviz_contrib(res.pca, choice = "var", fill = "name", color = "black", axes = 2, top = 7)
We can create a palette using hue_pal() from the scales package, for 10 different colours (one for each column of df).
(You can also use palettes such as rainbow() / heat.colors() / etc. from the base grDevices package. I find their default colour range to be rather intense, though, with a tendency to be overly glaring for a bar chart.)
mypalette <- scales::hue_pal()(ncol(df))
names(mypalette) <- colnames(df)
# optional: see what each color corresponds to
ggplot(data.frame(x = names(mypalette),
y = 1,
fill = mypalette)) +
geom_tile(aes(x = x, y = y, fill = fill), color = "black") +
scale_fill_identity() +
coord_equal()
Use scale_fill_manual() with the self-defined palette on each chart:
fviz_contrib(res.pca, choice = "var", fill = "name", color = "black", axes = 1, top = 7) +
scale_fill_manual(values = mypalette)
fviz_contrib(res.pca, choice = "var", fill = "name", color = "black", axes = 2, top = 7) +
scale_fill_manual(values = mypalette)

Related

Modify the number of rows in the legend of a plot

I´m generating a plot using ggnet and would like to modify the legend so i can increase the number of rows in the legend.
So far this is what i´m trying but i get a fixed number of rows in the legend. How can i increase that number
p <- ggnet2(net,
color = "Order",
palette = y,
alpha = 0.75,
size = 6,
edge.label = "weight",
edge.size=1,
edge.color="color",
edge.alpha = 0.5,
label = TRUE,
label.size = 4)
I have tried this without success:
p <- p + guides(fill=guide_legend(nrow=40,byrow=TRUE))

Colours across Plots / Heatmaps in R

I am creating a number of heatmaps in R, but I am having problems when it comes to keeping the colour scale consistent across graphs.
I find that the colours are scaled within a graph, is there a way to make colours consistent across graphs? Ie. So that that colour difference between a value of 0.4 and 0.5 is always the same?
Code Example:
set.seed(123)
d1 = matrix(rnorm(9, mean = 0.2, sd = 0.1), ncol = 3)
d2 = matrix(rnorm(9, mean = 0.8, sd = 0.1), ncol = 3)
mat = list(d1, d2)
for(m in mat)
heatmap(m, Rowv = NA ,Colv = NA)
You'll note in the example that cell (2,3) the first graph is similar to cell (1,3) in the second, despite being ~0.8 different

Here's a way to do it with ggplot2, if you're open to not using base graphics:
library(reshape2)
library(ggplot2)
# Set common limits for color scale
limits = range(unlist(mat))
Here's the code for two separate graphs. The last line of code for each graph ensures that they use the same z limits for setting the colors:
ggplot(melt(mat[[1]]), aes(Var1, Var2, fill=value)) +
geom_tile() +
scale_fill_continuous(limits=limits)
ggplot(melt(mat[[2]]), aes(Var1, Var2, fill=value)) +
geom_tile() +
scale_fill_continuous(limits=limits)
Another option is to plot both heatmaps in a single graph using facetting, which automatically ensures both graphs are on the same color scale:
ggplot(melt(mat), aes(Var1, Var2, fill=value)) +
geom_tile() +
facet_grid(. ~ L1)
I've used the default colors here, but for either approach you can set the color scale to be anything you wish. For example:
ggplot(melt(mat), aes(Var1, Var2, fill=value)) +
geom_tile() +
facet_grid(. ~ L1) +
scale_fill_gradient(low="red", high="green")

You could use the image function directly (heatmap uses image), though it will require some extra formatting to match the output of heatmap. You can use zlim to set the color range. Quoting from the ?image page:
the minimum and maximum z values for which colors should be plotted,
defaulting to the range of the finite values of z. Each of the given
colors will be used to color an equispaced interval of this range. The
midpoints of the intervals cover the range, so that values just
outside the range will be plotted.
# define zlim min and max for all the plots
minz = Reduce(min, mat)
maxz = Reduce(max, mat)
for(m in mat) {
image( m, zlim = c(minz, maxz), col = heat.colors(20))
}
To get closer to the formatting produced by heatmap, you can just reuse some code from the heatmap function:
for(m in mat) {
labCol = dim(m)[2]
labRow = dim(m)[1]
image(seq_len(labCol), seq_len(labRow), m, zlim = c(minz, maxz),
col = heat.colors(20), axes = FALSE, xlab = "", ylab = "",
xlim = 0.5 + c(0, labCol), ylim = 0.5 + c(0, labRow))
axis(1, 1L:labCol, labels = seq_len(labCol), las = 2, line = -0.5, tick = 0)
axis(4, 1L:labRow, labels = seq_len(labRow), las = 2, line = -0.5, tick = 0)
}
Using the breaks argument to image is another option. It allows more flexibility than zlim in setting the breakpoints for colors. Quoting from the help page, breaks is
a set of finite numeric breakpoints for the colours: must have one
more breakpoint than colour and be in increasing order. Unsorted
vectors will be sorted, with a warning.

Change loadings (arrows) length in PCA plot using ggplot2/ggfortify?

I have been struggling with rescaling the loadings (arrows) length in a ggplot2/ggfortify PCA. I have looked around extensively for an answer to this, and the only information I have found either code new biplot functions or refer to other entirely different packages for PCA (ggbiplot, factoextra), neither of which address the question I would like to answer:
Is it possible to scale/change size of PCA loadings in ggfortify?
Below is the code I have to plot a PCA using stock R functions as well as the code to plot a PCA using autoplot/ggfortify. You'll notice in the stock R plots I can scale the loads by simply multiplying by a scalar (*20 here) so my arrows aren't cramped in the middle of the PCA plot. Using autoplot...not so much. What am I missing? I'll move to another package if necessary but would really like to have a better understanding of ggfortify.
On other sites I have found, the graph axes limits never seem to exceed +/- 2. My graph goes +/- 20, and the loadings sit staunchly near 0, presumably at the same scale as graphs with smaller axes. I would still like to plot PCA using ggplot2, but if ggfortify won't do it then I need to find another package that will.
#load data geology rocks frame
georoc <- read.csv("http://people.ucsc.edu/~mclapham/earth125/data/georoc.csv")
#load libraries
library(ggplot2)
library(ggfortify)
geo.na <- na.omit(georoc) #remove NA values
geo_matrix <- as.matrix(geo.na[,3:29]) #create matrix of continuous data in data frame
pca.res <- prcomp(geo_matrix, scale = T) #perform PCA using correlation matrix (scale = T)
summary(pca.res) #return summary of PCA
#plotting in stock R
plot(pca.res$x, col = c("salmon","olivedrab","cadetblue3","purple")[geo.na$rock.type], pch = 16, cex = 0.2)
#make legend
legend("topleft", c("Andesite","Basalt","Dacite","Rhyolite"),
col = c("salmon","olivedrab","cadetblue3","purple"), pch = 16, bty = "n")
#add loadings and text
arrows(0, 0, pca.res$rotation[,1]*20, pca.res$rotation[,2]*20, length = 0.1)
text(pca.res$rotation[,1]*22, pca.res$rotation[,2]*22, rownames(pca.res$rotation), cex = 0.7)
#plotting PCA
autoplot(pca.res, data = geo.na, colour = "rock.type", #plot results, name using original data frame
loadings = T, loadings.colour = "black", loadings.label = T,
loadings.label.colour = "black")
The data comes from an online file from a class I'm taking, so you could just copy this if you have the ggplot2 and ggfortify packages installed. Graphs below.
R plot of what I want ggplot to look like
What ggplot actually looks like
Edit:
Adding reproducible code below.
iris.res <-
iris %>%
select(Sepal.Length:Petal.Width) %>%
as.matrix(.) %>%
prcomp(., scale = F)
autoplot(iris.res, data = iris, size = 4, col = "Species", shape = "Species",
x = 1, y = 2, #components 1 and 2
loadings = T, loadings.colour = "grey50", loadings.label = T,
loadings.label.colour = "grey50", loadings.label.repel = T) + #loadings are arrows
geom_vline(xintercept = 0, lty = 2) +
geom_hline(yintercept = 0, lty = 2) +
theme(aspect.ratio = 1) +
theme_bw()

This answer is probably long after the OP needs it, but I'm offering it because I have been wrestling with the same issue for a while, and maybe I can save someone else the same effort.
# Load data
iris <- data.frame(iris)
# Do PCA
PCA <- prcomp(iris[,1:4])
# Extract PC axes for plotting
PCAvalues <- data.frame(Species = iris$Species, PCA$x)
# Extract loadings of the variables
PCAloadings <- data.frame(Variables = rownames(PCA$rotation), PCA$rotation)
# Plot
ggplot(PCAvalues, aes(x = PC1, y = PC2, colour = Species)) +
geom_segment(data = PCAloadings, aes(x = 0, y = 0, xend = (PC1*5),
yend = (PC2*5)), arrow = arrow(length = unit(1/2, "picas")),
color = "black") +
geom_point(size = 3) +
annotate("text", x = (PCAloadings$PC1*5), y = (PCAloadings$PC2*5),
label = PCAloadings$Variables)
In order to increase the arrow length, multiply the loadings for the xend and yend in the geom_segment call. With a bit of trial and effort, can work out what number to use.
To place the labels in the correct place, multiply the PC axes by the same value in the annotate call.

dotplot dot not showing up and format of dot plot

How can I show the dots colored using the mosaic package to do a dotplot?
library(mosaic)
n=500
r =rnorm(n)
d = data.frame( x = sample(r ,n= 1,size = n, replace = TRUE), color = c(rep("red",n/2), rep("green",n/2)))
dotPlot(d$x,breaks = seq(min(d$x)-.1,max(d$x)+.1,.1))
right now all the dots are blue but I would like them to be colored according to the color column inthe data table

If you are still interested in a mosaic/lattice solution rather than a ggplot2 solution, here you go.
dotPlot( ~ x, data = d, width = 0.1, groups = color,
par.settings=list(superpose.symbol = list(pch = 16, col=c("green", "red"))))
resulting plot
Notice also
as with ggplot2, the colors are not determined by the values in your color variable but by the theme. You can use par.settings to modify this on the level of a plot or trellis.par.set() to change the defaults.
it is preferable to use a formula and data = and to avoid the $ operator.
you can use the width argument rather than breaks if you want to set the bin width. (You can use the center argument to control the centers of the bins if that matters to you. By default, 0 will be the center of a bin.)

You need to add stackgroups=TRUE so that the two different colors aren't plotted on top of each other.
n=20
set.seed(15)
d = data.frame(x = sample(seq(1,10,1), n, replace = TRUE),
color = c(rep("red",n/2), rep("green",n/2)))
table(d$x[order(d$x)])
length(d$x[order(d$x)])
binwidth= 1
ggplot(d, aes(x = x)) +
geom_dotplot(breaks = seq(0.5,10.5,1), binwidth = binwidth,
method="histodot", aes(fill = color),
stackgroups=TRUE) +
scale_x_continuous(breaks=1:10)
Also, ggplot uses its internal color palette for the fill aesthetic. You'd get the same colors regardless of what you called the values of the "color" column in your data. Add scale_fill_manual(values=c("green","red")) if you want to set the colors manually.

Place annotation at the top of a series of histograms in ggplot2 using a for loop

I am creating a number of histograms and I want to add annotations towards the top of the graph. I am plotting these using a for loop so I need a way to place the annotations at the top even though my ylims change from graph to graph. If I could store the ylim for each graph within the loop I could cause the y coordinates for my annotation to vary based on the current graph. The y value I include in my annotation must change dynamically as the loop proceeds across iterations. Here is some sample code to demonstrate my issue (Notice how the annotation moves around. I need it to change based on the ylim for each graph):
library(ggplot2)
cuts <- levels(as.factor(diamonds$cut))
pdf(file = "Annotation Example.pdf", width = 11, height = 8,
family = "Helvetica", bg = "white")
for (i in 1:length(cuts)) {
by.cut<-subset(diamonds, diamonds$cut == cuts[[i]])
print(ggplot(by.cut, aes(price)) +
geom_histogram(fill = "steelblue", alpha = .55) +
annotate ("text", label = "My annotation goes at the top", x = 10000 ,hjust = 0, y = 220, color = "darkred"))
}
dev.off()

ggplot uses Inf in its positions to represent the extremes of the plot range, without changing the plot range. So the y value of the annotation can be set to Inf, and the vjust parameter can also be adjusted to get a better alignment.
...
print(ggplot(by.cut, aes(price)) +
geom_histogram(fill = "steelblue", alpha = .55) +
annotate("text", label = "My annotation goes at the top",
x = 10000, hjust = 0, y = Inf, vjust = 2, color = "darkred"))
...
For i<-2, this looks as:

There may be a neater way, but you can get the max count and use that to set y in the annotate call:
for (i in 1:length(cuts)) {
by.cut<-subset(diamonds, diamonds$cut == cuts[[i]])
## get the cut points that ggplot will use. defaults to 30 bins and thus 29 cuts
by.cut$cuts <- cut(by.cut$price, seq(min(by.cut$price), max(by.cut$price), length.out=29))
## get the highest count of prices in a given cut.
y.max <- max(tapply(by.cut$price, by.cut$cuts, length))
print(ggplot(by.cut, aes(price)) +
geom_histogram(fill = "steelblue", alpha = .55) +
## change y = 220 to y = y.max as defined above
annotate ("text", label = "My annotation goes at the top", x = 10000 ,hjust = 0, y = y.max, color = "darkred"))
}