I am creating a number of heatmaps in R, but I am having problems when it comes to keeping the colour scale consistent across graphs.
I find that the colours are scaled within a graph, is there a way to make colours consistent across graphs? Ie. So that that colour difference between a value of 0.4 and 0.5 is always the same?
Code Example:
set.seed(123)
d1 = matrix(rnorm(9, mean = 0.2, sd = 0.1), ncol = 3)
d2 = matrix(rnorm(9, mean = 0.8, sd = 0.1), ncol = 3)
mat = list(d1, d2)
for(m in mat)
heatmap(m, Rowv = NA ,Colv = NA)
You'll note in the example that cell (2,3) the first graph is similar to cell (1,3) in the second, despite being ~0.8 different
Here's a way to do it with ggplot2, if you're open to not using base graphics:
library(reshape2)
library(ggplot2)
# Set common limits for color scale
limits = range(unlist(mat))
Here's the code for two separate graphs. The last line of code for each graph ensures that they use the same z limits for setting the colors:
ggplot(melt(mat[[1]]), aes(Var1, Var2, fill=value)) +
geom_tile() +
scale_fill_continuous(limits=limits)
ggplot(melt(mat[[2]]), aes(Var1, Var2, fill=value)) +
geom_tile() +
scale_fill_continuous(limits=limits)
Another option is to plot both heatmaps in a single graph using facetting, which automatically ensures both graphs are on the same color scale:
ggplot(melt(mat), aes(Var1, Var2, fill=value)) +
geom_tile() +
facet_grid(. ~ L1)
I've used the default colors here, but for either approach you can set the color scale to be anything you wish. For example:
ggplot(melt(mat), aes(Var1, Var2, fill=value)) +
geom_tile() +
facet_grid(. ~ L1) +
scale_fill_gradient(low="red", high="green")
You could use the image function directly (heatmap uses image), though it will require some extra formatting to match the output of heatmap. You can use zlim to set the color range. Quoting from the ?image page:
the minimum and maximum z values for which colors should be plotted,
defaulting to the range of the finite values of z. Each of the given
colors will be used to color an equispaced interval of this range. The
midpoints of the intervals cover the range, so that values just
outside the range will be plotted.
# define zlim min and max for all the plots
minz = Reduce(min, mat)
maxz = Reduce(max, mat)
for(m in mat) {
image( m, zlim = c(minz, maxz), col = heat.colors(20))
}
To get closer to the formatting produced by heatmap, you can just reuse some code from the heatmap function:
for(m in mat) {
labCol = dim(m)[2]
labRow = dim(m)[1]
image(seq_len(labCol), seq_len(labRow), m, zlim = c(minz, maxz),
col = heat.colors(20), axes = FALSE, xlab = "", ylab = "",
xlim = 0.5 + c(0, labCol), ylim = 0.5 + c(0, labRow))
axis(1, 1L:labCol, labels = seq_len(labCol), las = 2, line = -0.5, tick = 0)
axis(4, 1L:labRow, labels = seq_len(labRow), las = 2, line = -0.5, tick = 0)
}
Using the breaks argument to image is another option. It allows more flexibility than zlim in setting the breakpoints for colors. Quoting from the help page, breaks is
a set of finite numeric breakpoints for the colours: must have one
more breakpoint than colour and be in increasing order. Unsorted
vectors will be sorted, with a warning.
Related
I have a boxplot which summarizes ~60000 turbidity data points into quartiles, median, whiskers and sometimes outliers. Often a few outliers are so high up that the whole plot is compressed at the bottom, and I therefor choose to omit the outliers. However, I also have added averages to the plots as points, and I want these to be plotted always. The problem is that the y-axis of the boxplot does not adjust to the added average points, so when averages are far above the box they are simply plotted outside the chart window (see X-point for 2020, but none for 2021 or 2022). Normally with this parameter, the average will be between the whisker end and the most extreme outliers. This is normal, and expected in the data.
I have tried to capture the boxplot y-axis range to compare with the average, and then setting the ylim if needed, but I just don't know how to retrieve these axis ranges.
My code is just
boxplot(...)
points(...)
and works as far as plotting the points. Just not adjusting the y-axis.
Question 1: is it not possible to get the boxplot to redraw with the new points data? I thought this was standard in R plots.
Question 2: if not, how can I dynamically adjust the y-axis range?
Let's try to show a concrete example of the problem with some simulated data:
set.seed(1)
df <- data.frame(y = c(rexp(99), 150), x = rep(c("A", "B"), each = 50))
Here, group "B" has a single outlier at 150, even though most values are a couple of orders of magnitude lower. That means that if we try to draw a boxplot, the boxes get squished at the bottom of the plot:
boxplot(y ~ x, data = df, col = "lightblue")
If we remove outliers, the boxes plot nicely:
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
The problem comes when we want to add a point indicating the mean value for each boxplot, since the mean of "B" lies outside the plot limits. Let's calculate and plot the means:
mean_vals <- sapply(split(df$y, df$x), mean)
mean_vals
#> A B
#> 0.9840417 4.0703334
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
The mean for "B" is missing because it lies above the upper range of the plot.
The secret here is to use boxplot.stats to get the limits of the whiskers. By concatenating our vector of means to this vector of stats and getting its range, we can set our plot limits exactly where they need to be:
y_limits <- range(c(boxplot.stats(df$y)$stats, mean_vals))
Now we apply these limits to a new boxplot and draw it with the points:
boxplot(y ~ x, data = df, outline = FALSE, ylim = y_limits, col = "lightblue")
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
For comparison, you could do the whole thing in ggplot like this:
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_boxplot(fill = "lightblue", outlier.shape = NA) +
geom_point(size = 3, color = "red", stat = "summary", fun = mean) +
coord_cartesian(ylim = range(c(range(c(boxplot.stats(df$y)$stats,
mean_vals))))) +
theme_classic(base_size = 16)
Created on 2023-02-05 with reprex v2.0.2
I have been struggling with rescaling the loadings (arrows) length in a ggplot2/ggfortify PCA. I have looked around extensively for an answer to this, and the only information I have found either code new biplot functions or refer to other entirely different packages for PCA (ggbiplot, factoextra), neither of which address the question I would like to answer:
Is it possible to scale/change size of PCA loadings in ggfortify?
Below is the code I have to plot a PCA using stock R functions as well as the code to plot a PCA using autoplot/ggfortify. You'll notice in the stock R plots I can scale the loads by simply multiplying by a scalar (*20 here) so my arrows aren't cramped in the middle of the PCA plot. Using autoplot...not so much. What am I missing? I'll move to another package if necessary but would really like to have a better understanding of ggfortify.
On other sites I have found, the graph axes limits never seem to exceed +/- 2. My graph goes +/- 20, and the loadings sit staunchly near 0, presumably at the same scale as graphs with smaller axes. I would still like to plot PCA using ggplot2, but if ggfortify won't do it then I need to find another package that will.
#load data geology rocks frame
georoc <- read.csv("http://people.ucsc.edu/~mclapham/earth125/data/georoc.csv")
#load libraries
library(ggplot2)
library(ggfortify)
geo.na <- na.omit(georoc) #remove NA values
geo_matrix <- as.matrix(geo.na[,3:29]) #create matrix of continuous data in data frame
pca.res <- prcomp(geo_matrix, scale = T) #perform PCA using correlation matrix (scale = T)
summary(pca.res) #return summary of PCA
#plotting in stock R
plot(pca.res$x, col = c("salmon","olivedrab","cadetblue3","purple")[geo.na$rock.type], pch = 16, cex = 0.2)
#make legend
legend("topleft", c("Andesite","Basalt","Dacite","Rhyolite"),
col = c("salmon","olivedrab","cadetblue3","purple"), pch = 16, bty = "n")
#add loadings and text
arrows(0, 0, pca.res$rotation[,1]*20, pca.res$rotation[,2]*20, length = 0.1)
text(pca.res$rotation[,1]*22, pca.res$rotation[,2]*22, rownames(pca.res$rotation), cex = 0.7)
#plotting PCA
autoplot(pca.res, data = geo.na, colour = "rock.type", #plot results, name using original data frame
loadings = T, loadings.colour = "black", loadings.label = T,
loadings.label.colour = "black")
The data comes from an online file from a class I'm taking, so you could just copy this if you have the ggplot2 and ggfortify packages installed. Graphs below.
R plot of what I want ggplot to look like
What ggplot actually looks like
Edit:
Adding reproducible code below.
iris.res <-
iris %>%
select(Sepal.Length:Petal.Width) %>%
as.matrix(.) %>%
prcomp(., scale = F)
autoplot(iris.res, data = iris, size = 4, col = "Species", shape = "Species",
x = 1, y = 2, #components 1 and 2
loadings = T, loadings.colour = "grey50", loadings.label = T,
loadings.label.colour = "grey50", loadings.label.repel = T) + #loadings are arrows
geom_vline(xintercept = 0, lty = 2) +
geom_hline(yintercept = 0, lty = 2) +
theme(aspect.ratio = 1) +
theme_bw()
This answer is probably long after the OP needs it, but I'm offering it because I have been wrestling with the same issue for a while, and maybe I can save someone else the same effort.
# Load data
iris <- data.frame(iris)
# Do PCA
PCA <- prcomp(iris[,1:4])
# Extract PC axes for plotting
PCAvalues <- data.frame(Species = iris$Species, PCA$x)
# Extract loadings of the variables
PCAloadings <- data.frame(Variables = rownames(PCA$rotation), PCA$rotation)
# Plot
ggplot(PCAvalues, aes(x = PC1, y = PC2, colour = Species)) +
geom_segment(data = PCAloadings, aes(x = 0, y = 0, xend = (PC1*5),
yend = (PC2*5)), arrow = arrow(length = unit(1/2, "picas")),
color = "black") +
geom_point(size = 3) +
annotate("text", x = (PCAloadings$PC1*5), y = (PCAloadings$PC2*5),
label = PCAloadings$Variables)
In order to increase the arrow length, multiply the loadings for the xend and yend in the geom_segment call. With a bit of trial and effort, can work out what number to use.
To place the labels in the correct place, multiply the PC axes by the same value in the annotate call.
I'm trying to create a figure similar to the one below (taken from Ro, Russell, & Lavie, 2001). In their graph, they are plotting bars for the errors (i.e., accuracy) within the reaction time bars. Basically, what I am looking for is a way to plot bars within bars.
I know there are several challenges with creating a graph like this. First, Hadley points out that it is not possible to create a graph with two scales in ggplot2 because those graphs are fundamentally flawed (see Plot with 2 y axes, one y axis on the left, and another y axis on the right)
Nonetheless, the graph with superimposed bars seems to solve this dual sclaing problem, and I'm trying to figure out a way to create it in R. Any help would be appreciated.
It's fairly easy in base R, by using par(new = T) to add to an existing graph
set.seed(54321) # for reproducibility
data.1 <- sample(1000:2000, 10)
data.2 <- sample(seq(0, 5, 0.1), 10)
# Use xpd = F to avoid plotting the bars below the axis
barplot(data.1, las = 1, col = "black", ylim = c(500, 3000), xpd = F)
par(new = T)
# Plot the new data with a different ylim, but don't plot the axis
barplot(data.2, las = 1, col = "white", ylim = c(0, 30), yaxt = "n")
# Add the axis on the right
axis(4, las = 1)
It is pretty easy to make the bars in ggplot. Here is some example code. No two y-axes though (although look here for a way to do that too).
library(ggplot2)
data.1 <- sample(1000:2000, 10)
data.2 <- sample(500:1000, 10)
library(ggplot2)
ggplot(mapping = aes(x, y)) +
geom_bar(data = data.frame(x = 1:10, y = data.1), width = 0.8, stat = 'identity') +
geom_bar(data = data.frame(x = 1:10, y = data.2), width = 0.4, stat = 'identity', fill = 'white') +
theme_classic() + scale_y_continuous(expand = c(0, 0))
in R, with ecdf I can plot a empirical cumulative distribution function
plot(ecdf(mydata))
and with hist I can plot a histogram of my data
hist(mydata)
How I can plot the histogram and the ecdf in the same plot?
EDIT
I try make something like that
https://mathematica.stackexchange.com/questions/18723/how-do-i-overlay-a-histogram-with-a-plot-of-cdf
Also a bit late, here's another solution that extends #Christoph 's Solution with a second y-Axis.
par(mar = c(5,5,2,5))
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
par(new = T)
ec <- ecdf(dt)
plot(x = h$mids, y=ec(h$mids)*max(h$counts), col = rgb(0,0,0,alpha=0), axes=F, xlab=NA, ylab=NA)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
axis(4, at=seq(from = 0, to = max(h$counts), length.out = 11), labels=seq(0, 1, 0.1), col = 'red', col.axis = 'red')
mtext(side = 4, line = 3, 'Cumulative Density', col = 'red')
The trick is the following: You don't add a line to your plot, but plot another plot on top, that's why we need par(new = T). Then you have to add the y-axis later on (otherwise it will be plotted over the y-axis on the left).
Credits go here (#tim_yates Answer) and there.
There are two ways to go about this. One is to ignore the different scales and use relative frequency in your histogram. This results in a harder to read histogram. The second way is to alter the scale of one or the other element.
I suspect this question will soon become interesting to you, particularly #hadley 's answer.
ggplot2 single scale
Here is a solution in ggplot2. I am not sure you will be satisfied with the outcome though because the CDF and histograms (count or relative) are on quite different visual scales. Note this solution has the data in a dataframe called mydata with the desired variable in x.
library(ggplot2)
set.seed(27272)
mydata <- data.frame(x= rexp(333, rate=4) + rnorm(333))
ggplot(mydata, aes(x)) +
stat_ecdf(color="red") +
geom_bar(aes(y = (..count..)/sum(..count..)))
base R multi scale
Here I will rescale the empirical CDF so that instead of a max value of 1, its maximum value is whatever bin has the highest relative frequency.
h <- hist(mydata$x, freq=F)
ec <- ecdf(mydata$x)
lines(x = knots(ec),
y=(1:length(mydata$x))/length(mydata$x) * max(h$density),
col ='red')
you can try a ggplot approach with a second axis
set.seed(15)
a <- rnorm(500, 50, 10)
# calculate ecdf with binsize 30
binsize=30
df <- tibble(x=seq(min(a), max(a), diff(range(a))/binsize)) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(a))
# plot
ggplot() +
geom_histogram(aes(a), bins = binsize) +
geom_line(data = df, aes(x=x, y=Ecdf_scaled), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(a), name = "Ecdf"))
Edit
Since the scaling was wrong I added a second solution, calculatin everything in advance:
binsize=30
a_range= floor(range(a)) +c(0,1)
b <- seq(a_range[1], a_range[2], round(diff(a_range)/binsize)) %>% floor()
df_hist <- tibble(a) %>%
mutate(gr = cut(a,b, labels = floor(b[-1]), include.lowest = T, right = T)) %>%
count(gr) %>%
mutate(gr = as.character(gr) %>% as.numeric())
# calculate ecdf with binsize 30
df <- tibble(x=b) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(df_hist$n))
ggplot(df_hist, aes(gr, n)) +
geom_col(width = 2, color = "white") +
geom_line(data = df, aes(x=x, y=Ecdf*max(df_hist$n)), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(df_hist$n), name = "Ecdf"))
As already pointed out, this is problematic because the plots you want to merge have such different y-scales. You can try
set.seed(15)
mydata<-runif(50)
hist(mydata, freq=F)
lines(ecdf(mydata))
to get
Although a bit late... Another version which is working with preset bins:
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
ec <- ecdf(dt)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
lines(x = c(0,100), y=c(1,1)*max(h$counts), col ='red', lty = 3) # indicates 100%
lines(x = c(which.min(abs(ec(h$mids) - 0.9)), which.min(abs(ec(h$mids) - 0.9))), # indicates where 90% is reached
y = c(0, max(h$counts)), col ='black', lty = 3)
(Only the second y-axis is not working yet...)
In addition to previous answers, I wanted to have ggplot do the tedious calculation (in contrast to #Roman's solution, which was kindly enough updated upon my request), i.e., calculate and draw the histogram and calculate and overlay the ECDF. I came up with the following (pseudo code):
# 1. Prepare the plot
plot <- ggplot() + geom_hist(...)
# 2. Get the max value of Y axis as calculated in the previous step
maxPlotY <- max(ggplot_build(plot)$data[[1]]$y)
# 3. Overlay scaled ECDF and add secondary axis
plot +
stat_ecdf(aes(y=..y..*maxPlotY)) +
scale_y_continuous(name = "Density", sec.axis = sec_axis(trans = ~./maxPlotY, name = "ECDF"))
This way you don't need to calculate everything beforehand and feed the results to ggpplot. Just lay back and let it do everything for you!
I am creating a number of histograms and I want to add annotations towards the top of the graph. I am plotting these using a for loop so I need a way to place the annotations at the top even though my ylims change from graph to graph. If I could store the ylim for each graph within the loop I could cause the y coordinates for my annotation to vary based on the current graph. The y value I include in my annotation must change dynamically as the loop proceeds across iterations. Here is some sample code to demonstrate my issue (Notice how the annotation moves around. I need it to change based on the ylim for each graph):
library(ggplot2)
cuts <- levels(as.factor(diamonds$cut))
pdf(file = "Annotation Example.pdf", width = 11, height = 8,
family = "Helvetica", bg = "white")
for (i in 1:length(cuts)) {
by.cut<-subset(diamonds, diamonds$cut == cuts[[i]])
print(ggplot(by.cut, aes(price)) +
geom_histogram(fill = "steelblue", alpha = .55) +
annotate ("text", label = "My annotation goes at the top", x = 10000 ,hjust = 0, y = 220, color = "darkred"))
}
dev.off()
ggplot uses Inf in its positions to represent the extremes of the plot range, without changing the plot range. So the y value of the annotation can be set to Inf, and the vjust parameter can also be adjusted to get a better alignment.
...
print(ggplot(by.cut, aes(price)) +
geom_histogram(fill = "steelblue", alpha = .55) +
annotate("text", label = "My annotation goes at the top",
x = 10000, hjust = 0, y = Inf, vjust = 2, color = "darkred"))
...
For i<-2, this looks as:
There may be a neater way, but you can get the max count and use that to set y in the annotate call:
for (i in 1:length(cuts)) {
by.cut<-subset(diamonds, diamonds$cut == cuts[[i]])
## get the cut points that ggplot will use. defaults to 30 bins and thus 29 cuts
by.cut$cuts <- cut(by.cut$price, seq(min(by.cut$price), max(by.cut$price), length.out=29))
## get the highest count of prices in a given cut.
y.max <- max(tapply(by.cut$price, by.cut$cuts, length))
print(ggplot(by.cut, aes(price)) +
geom_histogram(fill = "steelblue", alpha = .55) +
## change y = 220 to y = y.max as defined above
annotate ("text", label = "My annotation goes at the top", x = 10000 ,hjust = 0, y = y.max, color = "darkred"))
}