I'd like to use something like ggplot2 and ggmap to produce a heat map of arbitrary values such as property prices per metre squared over a geographic area at a street level (with a high resolution).
Unfortunately, the task appears to be rather difficult because while ggplot2 can produce a great density plot, it seems unable to visualise spatial data like this without prior interpolation.
For this, I've used libraries akima (gridded bivariate interpolation for irregular data) and mgcv (generalised additive models with integrated smoothness estimation), however my knowledge of interpolation methods is mediocre at best and the results I've been able to produce aren't satisfactory enough.
Consider the following example:
Data
library(ggplot2)
library(ggmap)
## data simulation
set.seed(1945)
df <- tibble(x = rnorm(500, -0.7406, 0.03),
y = rnorm(500, 51.9976, 0.03),
z = abs(rnorm(500, 2000, 1000)))
Map, scatterplot, density plot
## ggmap
map <- get_map("Bletchley Park, Bletchley, Milton Keynes", zoom = 13, source = "stamen", maptype = "toner-background")
q <- ggmap(map, extent = "device", darken = .5)
## scatterplot over map
q + geom_point(aes(x, y), data = df, colour = z)
## classic density heat map
q +
stat_density2d(aes(x=x, y=y, fill=..level..), data=df, geom="polygon", alpha = .2) +
geom_density_2d(aes(x=x, y=y), data=df, colour = "white", alpha = .4) +
scale_fill_distiller(palette = "Spectral")
As you can see, the data are rather dense over the chosen area and the density heat map looks great with round edges and closed curves (except for some of the outermost layers).
Interpolation and plotting using akima
## akima interpolation
library(akima)
df_akima <-interp2xyz(interp(x=df$x, y=df$y, z=df$z, duplicate="mean", linear = T,
xo=seq(min(df$x), max(df$x), length=200),
yo=seq(min(df$y), max(df$y), length=200)), data.frame=TRUE)
## akima plot
q +
geom_tile(aes(x = x, y = y, fill = z), data = df_akima, alpha = .4) +
stat_contour(aes(x = x, y = y, z = z, fill = ..level..), data = df_akima, geom = 'polygon', alpha = .4) +
geom_contour(aes(x = x, y = y, z = z), data = df_akima, colour = 'white', alpha = .4) +
scale_fill_distiller(palette = "Spectral", na.value = NA)
This produces a dense grid of interpolated values (to ensure a sufficient resolution) and while the tile plot underneath is acceptable, the contour plots are too ragged and many of the curves aren't closed.
Non-linear interpolation using linear = F is smoother, but apparently sacrifices resolution and goes wild with the numbers (negative values of z).
Interpolation and plotting using mgcv
## mgcv interpolation
library(mgcv)
gam <- gam(z ~ s(x, y, bs = 'sos'), data = df)
df_mgcv <- data.frame(expand.grid(x = seq(min(df$x), max(df$x), length=200),
y = seq(min(df$y), max(df$y), length=200)))
resp <- predict(gam, df_mgcv, type = "response")
df_mgcv$z <- resp
## mgcv plot
q +
geom_tile(aes(x = x, y = y, fill = z), data = df_mgcv, alpha = .4) +
stat_contour(aes(x = x, y = y, z = z, fill = ..level..), data = df_mgcv, geom = 'polygon', alpha = .4) +
geom_contour(aes(x = x, y = y, z = z), data = df_mgcv, colour = 'white', alpha = .4) +
scale_fill_distiller(palette = "Spectral", na.value = NA)
The same process using mgcv results in a nice and smooth plot, but the resolution is much lower and practically all curves aren't closed.
Questions
Could you please suggest a better method or modify my attempt to obtain a plot similar to the first one (clean, connected, and smooth lines with high resolution)?
Is it possible to close the curves, e.g. in the last plot (the shaded area should be computed beyond the image boundaries)?
Thank you for your time!
The problem with your maps is not the interpolation method you're using, but the way ggplot displays density lines. Here's an answer to this: Remove gaps in a stat_density2d ggplot chart without modifying XY limits.
The density lines go beyond the map, so any polygon that goes outside the plot area is rendered inappropriately (ggplot will close the polygon using the next point of the correspondent level). This does not show up much on your first map because the interpolation resolution is low.
The trick proposed by Andrew is to first expand the plot area, so that the density lines are rendered correctly, then cut off the display area to hide the extra space. Since I tested his solution with your first example, here's the code:
q +
stat_density2d(
aes(x = x, y = y, fill = ..level..),
data = df,
geom = "polygon",
alpha = .2,
color = "white",
bins = 20
) +
scale_fill_distiller(
palette = "Spectral"
) +
xlim(
min(df$x) - 10^-5,
max(df$x) + 10^-5
) +
ylim(
min(df$y) - 10^-3,
max(df$y) + 10^-3
) +
coord_equal(
expand = FALSE,
xlim = c(-.778, -.688),
ylim = c(51.965, 52.03)
)
The only differences is that I used min()- / max() + instead of fixed numbers and coord_equal to ensure the map wasn't distorted. In addition, I manually specified a greater number of levels (using bin), since by increasing the plot area, stat_density automatically chooses a lower resolution.
As for the best interpolation method, this depends on your objective and the type of data you have. The question is not what is the best method for your map, but what is the best method for your data. This is a very broad issue, out of scope for this space. But here's a good guide: http://www.rspatial.org/analysis/rst/4-interpolation.html
For general ideas on how to make good maps in R using ggplot: http://spatial.ly/r/
Sorry, I can't run your example at the moment to provide details. But try autoKrige() from automap package.
Kriging is a great method for interpolation. Just be sure that your data fits the requisitions. Here's a good guide:
https://gisgeography.com/kriging-interpolation-prediction/
Related
I have a boxplot which summarizes ~60000 turbidity data points into quartiles, median, whiskers and sometimes outliers. Often a few outliers are so high up that the whole plot is compressed at the bottom, and I therefor choose to omit the outliers. However, I also have added averages to the plots as points, and I want these to be plotted always. The problem is that the y-axis of the boxplot does not adjust to the added average points, so when averages are far above the box they are simply plotted outside the chart window (see X-point for 2020, but none for 2021 or 2022). Normally with this parameter, the average will be between the whisker end and the most extreme outliers. This is normal, and expected in the data.
I have tried to capture the boxplot y-axis range to compare with the average, and then setting the ylim if needed, but I just don't know how to retrieve these axis ranges.
My code is just
boxplot(...)
points(...)
and works as far as plotting the points. Just not adjusting the y-axis.
Question 1: is it not possible to get the boxplot to redraw with the new points data? I thought this was standard in R plots.
Question 2: if not, how can I dynamically adjust the y-axis range?
Let's try to show a concrete example of the problem with some simulated data:
set.seed(1)
df <- data.frame(y = c(rexp(99), 150), x = rep(c("A", "B"), each = 50))
Here, group "B" has a single outlier at 150, even though most values are a couple of orders of magnitude lower. That means that if we try to draw a boxplot, the boxes get squished at the bottom of the plot:
boxplot(y ~ x, data = df, col = "lightblue")
If we remove outliers, the boxes plot nicely:
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
The problem comes when we want to add a point indicating the mean value for each boxplot, since the mean of "B" lies outside the plot limits. Let's calculate and plot the means:
mean_vals <- sapply(split(df$y, df$x), mean)
mean_vals
#> A B
#> 0.9840417 4.0703334
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
The mean for "B" is missing because it lies above the upper range of the plot.
The secret here is to use boxplot.stats to get the limits of the whiskers. By concatenating our vector of means to this vector of stats and getting its range, we can set our plot limits exactly where they need to be:
y_limits <- range(c(boxplot.stats(df$y)$stats, mean_vals))
Now we apply these limits to a new boxplot and draw it with the points:
boxplot(y ~ x, data = df, outline = FALSE, ylim = y_limits, col = "lightblue")
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
For comparison, you could do the whole thing in ggplot like this:
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_boxplot(fill = "lightblue", outlier.shape = NA) +
geom_point(size = 3, color = "red", stat = "summary", fun = mean) +
coord_cartesian(ylim = range(c(range(c(boxplot.stats(df$y)$stats,
mean_vals))))) +
theme_classic(base_size = 16)
Created on 2023-02-05 with reprex v2.0.2
I need to build a plot showing the difference in variance between the ordinary least squares estimator and the ridge estimator. For this I want to plot contour ellipses for the variance.
The data I'm working with is this:
set.seed(10)
sigma = 10
U = rnorm(50,0,sigma) #The errors
X = scale(matrix(rnorm(50*2),ncol=2))
Y = scale(U + X%*%c(5,-2))
I then find the ridge(r) and OLS(m) and their covariancematrices:
(EDIT: I've realised the covariance matrix for the ridge estimator is wrong, however that doesn't really matter. It will just change the ellipses, not how to plot the.)
lbda = 100
r = solve(t(X)%*%X + lbda * diag(2))%*%t(X)%*%Y
m = solve(t(X)%*%X)%*%t(X)%*%Y
covmatr = sigma^2 * solve(t(X)%*%X + lbda * diag(2))
covmatm = sigma^2 * solve(t(X)%*%X)
I then want to build the plot. In standard R I would do something like this using the function ellipse from the car-module:
library(car)
plot(r[1],r[2],xlim=c(-5,5),ylim=c(-5,5), xlab = TeX("$\\beta_1$"), col="red")
points(m[1],m[2], col="blue")
abline(h=0)
abline(v=0)
ellipse(center=c(r),shape=covmatr,center.cex=0,radius=sqrt(qchisq(0.5,2)), col = "red")
ellipse(center=c(r),shape=covmatr,center.cex=0,radius=sqrt(qchisq(0.9,2)), col = "red")
ellipse(center=c(m),shape=covmatm,center.cex=0,radius=sqrt(qchisq(0.5,2)))
ellipse(center=c(m),shape=covmatm,center.cex=0,radius=sqrt(qchisq(0.9,2)))
I realize I forgot to rename the y-label
My problem is that I cant figure out how to do the same in ggplot2. I have tried:
#Building the plot with dots
rd = data.frame(r[1],r[2])
plot1 = ggplot(rd, aes(x=rd$r.1, y = rd$r.2, color = 'r')) +
geom_point(size = 3) +
geom_point(aes(x=m[1], y=m[2]), colour="blue", size = 3) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
xlim(-5,5) +
ylim(-5,5) +
ggtitle(TeX("Varians for ridge"))+
xlab(TeX("$\\beta_1$")) + ylab("$\\beta_2$")+
theme_bw()
plot1
#Trying to add ellipses:
ellipse(center=c(r),shape=covmatr,center.cex=0,radius=sqrt(qchisq(0.5,2)), col = "red")
ellipse(center=c(r),shape=covmatr,center.cex=0,radius=sqrt(qchisq(0.9,2)), col = "red")
ellipse(center=c(m),shape=covmatm,center.cex=0,radius=sqrt(qchisq(0.5,2)))
ellipse(center=c(m),shape=covmatm,center.cex=0,radius=sqrt(qchisq(0.9,2)))
However this gives me the following error
Fejl i plot.xy(xy.coords(x, y), type = type, ...) :
plot.new has not been called yet
I have also tried to simply add the ellipses within the ggplot like this using the ellipse function from the car package:
plot1 = ... +
ellipse(center=c(m),shape=covmatm,center.cex=0,radius=sqrt(qchisq(0.9,2)))
where "..." depicts that there are lines missing. I know both my attempts are quite naive, however I hoped one would work. I have found the command stat_ellipse but this is not useful in my case because as far as I can see it cannot calculate the covariance matrix of the ridge estimator.
I am trying to recreate an image found in a textbook in R, the original of which was built in MATLAB:
I have generated each of the graphs seperately, but what would be best practice them into an image like this in ggplot2?
Edit: Provided code used. This is just a transformation of normally distributed data.
library(ggplot2)
mean <- 6
sd <- 1
X <- rnorm(100000, mean = mean, sd = sd)
Y <- dnorm(X, mean = mean, sd = sd)
Y_p <- pnorm(X, mean = mean, sd = sd)
ch_vars <- function(X){
nu_vars <- c()
for (x in X){
nu_vars <- c(nu_vars, (1/(1 + exp(-x + 5))))
}
return(nu_vars)
}
nu_X <- ch_vars(X)
nu_Y <- ch_vars(Y)
data <- data.frame(x = X, y = Y, Y_p = Y_p, x = nu_X, y = nu_Y)
# Cumulative distribution
ggplot(data = data) +
geom_line(aes(x = X, y = Y_p))
# Distribution of initial data
ggplot(data = data_ch, aes(x = X)) +
geom_histogram(aes(y = ..density..), bins = 25, fill = "red", color = "black")
# Distribution of transformed data
ggplot(data = data, aes(x = nu_X)) +
geom_histogram(aes(y = ..density..), bins = 25, fill = "green", color = "black")
In short, you can't, or rather, you shouldn't.
ggplot is a high-level plotting packaging. More than a system for drawing shapes and lines, it's fairly "opinionated" about how data should be represented, and one of its opinions is that a plot should express a clear relationship between its axes and marks (points, bars, lines, etc.). The axes essentially define a coordinate space, and the marks are then plotted onto the space in a straightforward and easily interpretable manner.
The plot you show breaks that relationship -- it's a set of essentially arbitrary histograms all drawn onto the same box, where the axis values become ambiguous. The x-axis represents the values of 1 histogram and the y-axis represents another (and thus neither axis represents the histograms' heights).
It is of course technically possible to force ggplot to render something like your example, but it would require pre-computing the histograms, normalizing their values and bin heights to a common coordinate space, converting these into suitable coordinates for use with geom_rect, and then re-labeling the plot axes. It would be a very large amount of manual effort and ultimately defeats the point of using a high-level plotting grammar like ggplot.
I am creating a number of heatmaps in R, but I am having problems when it comes to keeping the colour scale consistent across graphs.
I find that the colours are scaled within a graph, is there a way to make colours consistent across graphs? Ie. So that that colour difference between a value of 0.4 and 0.5 is always the same?
Code Example:
set.seed(123)
d1 = matrix(rnorm(9, mean = 0.2, sd = 0.1), ncol = 3)
d2 = matrix(rnorm(9, mean = 0.8, sd = 0.1), ncol = 3)
mat = list(d1, d2)
for(m in mat)
heatmap(m, Rowv = NA ,Colv = NA)
You'll note in the example that cell (2,3) the first graph is similar to cell (1,3) in the second, despite being ~0.8 different
Here's a way to do it with ggplot2, if you're open to not using base graphics:
library(reshape2)
library(ggplot2)
# Set common limits for color scale
limits = range(unlist(mat))
Here's the code for two separate graphs. The last line of code for each graph ensures that they use the same z limits for setting the colors:
ggplot(melt(mat[[1]]), aes(Var1, Var2, fill=value)) +
geom_tile() +
scale_fill_continuous(limits=limits)
ggplot(melt(mat[[2]]), aes(Var1, Var2, fill=value)) +
geom_tile() +
scale_fill_continuous(limits=limits)
Another option is to plot both heatmaps in a single graph using facetting, which automatically ensures both graphs are on the same color scale:
ggplot(melt(mat), aes(Var1, Var2, fill=value)) +
geom_tile() +
facet_grid(. ~ L1)
I've used the default colors here, but for either approach you can set the color scale to be anything you wish. For example:
ggplot(melt(mat), aes(Var1, Var2, fill=value)) +
geom_tile() +
facet_grid(. ~ L1) +
scale_fill_gradient(low="red", high="green")
You could use the image function directly (heatmap uses image), though it will require some extra formatting to match the output of heatmap. You can use zlim to set the color range. Quoting from the ?image page:
the minimum and maximum z values for which colors should be plotted,
defaulting to the range of the finite values of z. Each of the given
colors will be used to color an equispaced interval of this range. The
midpoints of the intervals cover the range, so that values just
outside the range will be plotted.
# define zlim min and max for all the plots
minz = Reduce(min, mat)
maxz = Reduce(max, mat)
for(m in mat) {
image( m, zlim = c(minz, maxz), col = heat.colors(20))
}
To get closer to the formatting produced by heatmap, you can just reuse some code from the heatmap function:
for(m in mat) {
labCol = dim(m)[2]
labRow = dim(m)[1]
image(seq_len(labCol), seq_len(labRow), m, zlim = c(minz, maxz),
col = heat.colors(20), axes = FALSE, xlab = "", ylab = "",
xlim = 0.5 + c(0, labCol), ylim = 0.5 + c(0, labRow))
axis(1, 1L:labCol, labels = seq_len(labCol), las = 2, line = -0.5, tick = 0)
axis(4, 1L:labRow, labels = seq_len(labRow), las = 2, line = -0.5, tick = 0)
}
Using the breaks argument to image is another option. It allows more flexibility than zlim in setting the breakpoints for colors. Quoting from the help page, breaks is
a set of finite numeric breakpoints for the colours: must have one
more breakpoint than colour and be in increasing order. Unsorted
vectors will be sorted, with a warning.
I have been struggling with rescaling the loadings (arrows) length in a ggplot2/ggfortify PCA. I have looked around extensively for an answer to this, and the only information I have found either code new biplot functions or refer to other entirely different packages for PCA (ggbiplot, factoextra), neither of which address the question I would like to answer:
Is it possible to scale/change size of PCA loadings in ggfortify?
Below is the code I have to plot a PCA using stock R functions as well as the code to plot a PCA using autoplot/ggfortify. You'll notice in the stock R plots I can scale the loads by simply multiplying by a scalar (*20 here) so my arrows aren't cramped in the middle of the PCA plot. Using autoplot...not so much. What am I missing? I'll move to another package if necessary but would really like to have a better understanding of ggfortify.
On other sites I have found, the graph axes limits never seem to exceed +/- 2. My graph goes +/- 20, and the loadings sit staunchly near 0, presumably at the same scale as graphs with smaller axes. I would still like to plot PCA using ggplot2, but if ggfortify won't do it then I need to find another package that will.
#load data geology rocks frame
georoc <- read.csv("http://people.ucsc.edu/~mclapham/earth125/data/georoc.csv")
#load libraries
library(ggplot2)
library(ggfortify)
geo.na <- na.omit(georoc) #remove NA values
geo_matrix <- as.matrix(geo.na[,3:29]) #create matrix of continuous data in data frame
pca.res <- prcomp(geo_matrix, scale = T) #perform PCA using correlation matrix (scale = T)
summary(pca.res) #return summary of PCA
#plotting in stock R
plot(pca.res$x, col = c("salmon","olivedrab","cadetblue3","purple")[geo.na$rock.type], pch = 16, cex = 0.2)
#make legend
legend("topleft", c("Andesite","Basalt","Dacite","Rhyolite"),
col = c("salmon","olivedrab","cadetblue3","purple"), pch = 16, bty = "n")
#add loadings and text
arrows(0, 0, pca.res$rotation[,1]*20, pca.res$rotation[,2]*20, length = 0.1)
text(pca.res$rotation[,1]*22, pca.res$rotation[,2]*22, rownames(pca.res$rotation), cex = 0.7)
#plotting PCA
autoplot(pca.res, data = geo.na, colour = "rock.type", #plot results, name using original data frame
loadings = T, loadings.colour = "black", loadings.label = T,
loadings.label.colour = "black")
The data comes from an online file from a class I'm taking, so you could just copy this if you have the ggplot2 and ggfortify packages installed. Graphs below.
R plot of what I want ggplot to look like
What ggplot actually looks like
Edit:
Adding reproducible code below.
iris.res <-
iris %>%
select(Sepal.Length:Petal.Width) %>%
as.matrix(.) %>%
prcomp(., scale = F)
autoplot(iris.res, data = iris, size = 4, col = "Species", shape = "Species",
x = 1, y = 2, #components 1 and 2
loadings = T, loadings.colour = "grey50", loadings.label = T,
loadings.label.colour = "grey50", loadings.label.repel = T) + #loadings are arrows
geom_vline(xintercept = 0, lty = 2) +
geom_hline(yintercept = 0, lty = 2) +
theme(aspect.ratio = 1) +
theme_bw()
This answer is probably long after the OP needs it, but I'm offering it because I have been wrestling with the same issue for a while, and maybe I can save someone else the same effort.
# Load data
iris <- data.frame(iris)
# Do PCA
PCA <- prcomp(iris[,1:4])
# Extract PC axes for plotting
PCAvalues <- data.frame(Species = iris$Species, PCA$x)
# Extract loadings of the variables
PCAloadings <- data.frame(Variables = rownames(PCA$rotation), PCA$rotation)
# Plot
ggplot(PCAvalues, aes(x = PC1, y = PC2, colour = Species)) +
geom_segment(data = PCAloadings, aes(x = 0, y = 0, xend = (PC1*5),
yend = (PC2*5)), arrow = arrow(length = unit(1/2, "picas")),
color = "black") +
geom_point(size = 3) +
annotate("text", x = (PCAloadings$PC1*5), y = (PCAloadings$PC2*5),
label = PCAloadings$Variables)
In order to increase the arrow length, multiply the loadings for the xend and yend in the geom_segment call. With a bit of trial and effort, can work out what number to use.
To place the labels in the correct place, multiply the PC axes by the same value in the annotate call.