Add straight line and text on a hexbinplot - r

I want to add a standard 1:1 line (with intercept 0 and slope 1) and some text (for example R-square) on my hexbinplot. The trial codes could be something like
require(hexbin)
x1 <- rnorm(10000)
x2 <- rnorm(10000)
df <- data.frame(cbind(x1, x2))
hex <- hexbin(x1, x2, xbins = 300)
hexbinplot(x2 ~ x1, data = df, aspect = '1', xbins = 300, xlim = c(-5, 5), ylim = c(-5, 5))
hexVP.abline(hexViewport(hex), 0, 1)
This gives me the plot below
The line added have two issues which are
It should be from the lower left corner to the top right corner,but the slope (should be 1) and intercept (should be 0) looks change when I zoom in/out the graphic window in RStudio and
the both ends of the line does not extend to the plot border
Another question is about how to add a text on the plot.
The ideal plot could be looks like

Cute that a package still uses lattice for graphics. That's really retro!
Here is the lattice way:
hexbinplot(x2 ~ x1, data = df, aspect = '1', xbins = 300, xlim = c(-5, 5), ylim = c(-5, 5),
panel = function(x, y, ...) {
panel.hexbinplot(x, y, ...)
lattice::panel.abline(a = 0, b = 1)
})
(Edit: After you added additional requirements: Use panel.text to add text to a lattice plot.)
Personally, I'd use ggplot2 and its geom_hex:
library(ggplot2)
ggplot(df, aes(x = x1, y = x2)) +
geom_hex(bins = 300) +
xlim(-5, 5) + ylim(-5, 5) +
geom_abline(intercept = 0, slope = 1)

Related

Is there a way to use R to break chart axis and break linear regression line?

I'm trying to figure out how to modify a scatter-plot that contains two groups of data along a continuum separated by a large gap. The graph needs a break on the x-axis as well as on the regression line.
This R code using the ggplot2 library accurately presents the data, but is unsightly due to the vast amount of empty space on the graph. Pearson's correlation is -0.1380438.
library(ggplot2)
p <- ggplot(, aes(x = dis, y = result[, 1])) + geom_point(shape = 1) +
xlab("X-axis") +
ylab("Y-axis") + geom_smooth(color = "red", method = "lm", se = F) + theme_classic()
p + theme(plot.title = element_text(hjust = 0.5, size = 14))
This R code uses gap.plot to produce the breaks needed, but the regression line doesn't contain a break and doesn't reflect the slope properly. As you can see, the slope of the regression line isn't as sharp as the graph above and there needs to be a visible distinction in the slope of the line between those disparate groups.
library(plotrix)
gap.plot(
x = dis,
y = result[, 1],
gap = c(700, 4700),
gap.axis = "x",
xlab = "X-Axis",
ylab = "Y-Axis",
xtics = seq(0, 5575, by = 200)
)
abline(v = seq(700, 733) , col = "white")
abline(lm(result[, 1] ~ dis), col = "red", lwd = 2)
axis.break(1, 716, style = "slash")
Using MS Paint, I created an approximation of what the graph should look like. Notice the break marks on the top as well as the discontinuity between on the regression line between the two groups.
One solution is to plot the regression line in two pieces, using ablineclip to limit what's plotted each time. (Similar to #tung's suggestion, although it's clear that you want the appearance of a single graph rather than the appearance of facets.) Here's how that would work:
library(plotrix)
# Simulate some data that looks roughly like the original graph.
dis = c(rnorm(100, 300, 50), rnorm(100, 5000, 100))
result = c(rnorm(100, 0.6, 0.1), rnorm(100, 0.5, 0.1))
# Store the location of the gap so we can refer to it later.
x.axis.gap = c(700, 4700)
# gap.plot() works internally by shifting the location of the points to be
# plotted based on the gap size/location, and then adjusting the axis labels
# accordingly. We'll re-compute the second half of the regression line in the
# same way; these are the new values for the x-axis.
dis.alt = dis - x.axis.gap[1]
# Plot (same as before).
gap.plot(
x = dis,
y = result,
gap = x.axis.gap,
gap.axis = "x",
xlab = "X-Axis",
ylab = "Y-Axis",
xtics = seq(0, 5575, by = 200)
)
abline(v = seq(700, 733), col = "white")
axis.break(1, 716, style = "slash")
# Add regression line in two pieces: from 0 to the start of the gap, and from
# the end of the gap to infinity.
ablineclip(lm(result ~ dis), col = "red", lwd = 2, x2 = x.axis.gap[1])
ablineclip(lm(result ~ dis.alt), col = "red", lwd = 2, x1 = x.axis.gap[1] + 33)

How to add to ggplot2 plot inside of for loop

I'm trying to plot multiple circles of different sizes on a plot using ggplot2's geom_point inside of a for loop. Every time I run it though, it plots all the circles, but all in the location of the last circle instead of in their respective locations as given by the data frame. Below is an example of the code I am running. I'm wondering how I would fix this or if there's a better way to get at what I'm trying to do here.
data <- data.frame("x" = c(0, 500, 1000, 1500, 2000),
"y" = c(1500, 500, 2000, 0, 1000),
"size" = c(3, 5, 1.5, 4.2, 2.6)
)
g <- ggplot(data = data, aes(x = x, y = y)) + xlim(0,2000) + ylim(0,2000)
for(i in 1:5) {
g <- g + geom_point(aes(x=data$x[i],y=data$y[i]), size = data$size[i], pch = 1)
}
print(g)
It's pretty rare to need a for-loop for a plot -- ggplot2 will take the whole dataframe and process it all without you needing to manage each row.
ggplot(data = data, aes(x = x, y = y, size = size)) +
geom_point(pch = 1)

r : ecdf over histogram

in R, with ecdf I can plot a empirical cumulative distribution function
plot(ecdf(mydata))
and with hist I can plot a histogram of my data
hist(mydata)
How I can plot the histogram and the ecdf in the same plot?
EDIT
I try make something like that
https://mathematica.stackexchange.com/questions/18723/how-do-i-overlay-a-histogram-with-a-plot-of-cdf
Also a bit late, here's another solution that extends #Christoph 's Solution with a second y-Axis.
par(mar = c(5,5,2,5))
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
par(new = T)
ec <- ecdf(dt)
plot(x = h$mids, y=ec(h$mids)*max(h$counts), col = rgb(0,0,0,alpha=0), axes=F, xlab=NA, ylab=NA)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
axis(4, at=seq(from = 0, to = max(h$counts), length.out = 11), labels=seq(0, 1, 0.1), col = 'red', col.axis = 'red')
mtext(side = 4, line = 3, 'Cumulative Density', col = 'red')
The trick is the following: You don't add a line to your plot, but plot another plot on top, that's why we need par(new = T). Then you have to add the y-axis later on (otherwise it will be plotted over the y-axis on the left).
Credits go here (#tim_yates Answer) and there.
There are two ways to go about this. One is to ignore the different scales and use relative frequency in your histogram. This results in a harder to read histogram. The second way is to alter the scale of one or the other element.
I suspect this question will soon become interesting to you, particularly #hadley 's answer.
ggplot2 single scale
Here is a solution in ggplot2. I am not sure you will be satisfied with the outcome though because the CDF and histograms (count or relative) are on quite different visual scales. Note this solution has the data in a dataframe called mydata with the desired variable in x.
library(ggplot2)
set.seed(27272)
mydata <- data.frame(x= rexp(333, rate=4) + rnorm(333))
ggplot(mydata, aes(x)) +
stat_ecdf(color="red") +
geom_bar(aes(y = (..count..)/sum(..count..)))
base R multi scale
Here I will rescale the empirical CDF so that instead of a max value of 1, its maximum value is whatever bin has the highest relative frequency.
h <- hist(mydata$x, freq=F)
ec <- ecdf(mydata$x)
lines(x = knots(ec),
y=(1:length(mydata$x))/length(mydata$x) * max(h$density),
col ='red')
you can try a ggplot approach with a second axis
set.seed(15)
a <- rnorm(500, 50, 10)
# calculate ecdf with binsize 30
binsize=30
df <- tibble(x=seq(min(a), max(a), diff(range(a))/binsize)) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(a))
# plot
ggplot() +
geom_histogram(aes(a), bins = binsize) +
geom_line(data = df, aes(x=x, y=Ecdf_scaled), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(a), name = "Ecdf"))
Edit
Since the scaling was wrong I added a second solution, calculatin everything in advance:
binsize=30
a_range= floor(range(a)) +c(0,1)
b <- seq(a_range[1], a_range[2], round(diff(a_range)/binsize)) %>% floor()
df_hist <- tibble(a) %>%
mutate(gr = cut(a,b, labels = floor(b[-1]), include.lowest = T, right = T)) %>%
count(gr) %>%
mutate(gr = as.character(gr) %>% as.numeric())
# calculate ecdf with binsize 30
df <- tibble(x=b) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(df_hist$n))
ggplot(df_hist, aes(gr, n)) +
geom_col(width = 2, color = "white") +
geom_line(data = df, aes(x=x, y=Ecdf*max(df_hist$n)), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(df_hist$n), name = "Ecdf"))
As already pointed out, this is problematic because the plots you want to merge have such different y-scales. You can try
set.seed(15)
mydata<-runif(50)
hist(mydata, freq=F)
lines(ecdf(mydata))
to get
Although a bit late... Another version which is working with preset bins:
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
ec <- ecdf(dt)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
lines(x = c(0,100), y=c(1,1)*max(h$counts), col ='red', lty = 3) # indicates 100%
lines(x = c(which.min(abs(ec(h$mids) - 0.9)), which.min(abs(ec(h$mids) - 0.9))), # indicates where 90% is reached
y = c(0, max(h$counts)), col ='black', lty = 3)
(Only the second y-axis is not working yet...)
In addition to previous answers, I wanted to have ggplot do the tedious calculation (in contrast to #Roman's solution, which was kindly enough updated upon my request), i.e., calculate and draw the histogram and calculate and overlay the ECDF. I came up with the following (pseudo code):
# 1. Prepare the plot
plot <- ggplot() + geom_hist(...)
# 2. Get the max value of Y axis as calculated in the previous step
maxPlotY <- max(ggplot_build(plot)$data[[1]]$y)
# 3. Overlay scaled ECDF and add secondary axis
plot +
stat_ecdf(aes(y=..y..*maxPlotY)) +
scale_y_continuous(name = "Density", sec.axis = sec_axis(trans = ~./maxPlotY, name = "ECDF"))
This way you don't need to calculate everything beforehand and feed the results to ggpplot. Just lay back and let it do everything for you!

zoom in a CDF figure

I have the below figure shows cdf. I am wondering how can I zoom in to show better the difference between four lines in the left upper part of the figure.
You can use coord_cartesian to zoom in. I don't know what you mean by having the zoom part and the whole in the same figure. If you want to have them side by side, you can use the multiplot function found in the Cookbook for R page. For example:
df <- data.frame(x = c(rnorm(100, 0, 3), rnorm(100, 0, 10)),
g = gl(2, 100))
p <- ggplot(df, aes(x, colour = g)) + stat_ecdf()
p1 <- p
p2 <- p + coord_cartesian(ylim = c(.75, 1))
multiplot(p1, p2)
Edit
Based on #Paul Lemmens' comment, you can use grid's viewport function in the following way:
pdf("~/Desktop/foo.pdf", width = 6, height = 6)
subvp <- viewport(width = .4, height = .4, x = .75, y = .25)
p1
print(p2, vp = subvp)
dev.off()
which gives the following output - adjust the details for your specific example:

How to draw the curves in an energy diagram in R?

I wrote following R script:
#energy diagram
x <- c(0.1, 0.3, 0.5, 0.7, 0.9 ) #chosen randomly, reaction axis
y <- c(-5.057920, -5.057859, -5.057887,-5.057674, -5.057919 ) #energy of the educt, intermediate, transtition states and product
plot(x,y, type="p",
xlim=c(0,1),
ylim=c(-5.058,-5.0575),
xlab="reaction axis",
ylab=expression(paste(E[el] ," / ",10^6," ",kJ/mol)),
xaxt="n" #hide x-axis
)
#h- and v-lines, so i can draw curves by hand
abline(v=seq(0,1,0.1),h=seq(-5.0600,-5.0500,0.00005),col="black",lty=1,lwd=1)
abline(h=c(-5.057920, -5.057859, -5.057887,-5.057674), col="blue", lty=1,lwd=0.7)
Is it possible to draw a curve through the points that would look like a energy diagram. An example of an energy diagram is here:
A lot could be done to streamline / vectorize this code, but for a smallish diagram this works pretty well:
# get that data
x <- c(0.1, 0.3, 0.5, 0.7, 0.9 ) # reaction axis
y <- c(-5.057920, -5.057859, -5.057887,-5.057674, -5.057919 ) # energies
I'm going to make a little Bezier curve to connect each point to the next---this way we can make sure the smooth line passes through the data, not just close to it. I'll give each point a single 'control point' to define the slope. By using the same y-values for a point and it's control point, the slope at the point will be 0. I'll call the offset between the point and the control point delta. We'll start with one point-pair:
library(Hmisc)
delta = 0.15
bezx = c(0.1, 0.1 + delta, 0.3 - delta, 0.3)
bezy = rep(y[1:2], each = 2)
plot(bezx, bezy, type = 'b', col = "gray80")
lines(bezier(bezx, bezy), lwd = 2, col = "firebrick4")
Here I plotted the points and control points in gray, and the smooth line in red so we can see what's going on.
It looks promising, let's turn it into a function that we can apply to each pair of points:
bezf = function(x1, x2, y1, y2, delta = 0.15) {
bezier(x = c(x1, x1 + delta, x2 - delta, x2), y = c(y1, y1, y2, y2))
}
You can play with the delta parameter, I think 0.1 looks pretty good.
plot(x, y, xlab = "Reaction coordinate", ylab = "E", axes = F)
box(bty = "L")
axis(side = 2)
for(i in 1:(length(x) - 1)) {
lines(bezf(x1 = x[i], x2 = x[i + 1], y1 = y[i], y2 = y[i + 1], delta = 0.1))
}
You can of course tweak the plot, add labels, and ablines as in your original. (Use my for loop with the lines command to draw only the smoothed lines.) I left the points on to show that we are passing through them, not just getting close.
I prefer plotting in ggplot2, if you do too you'll need to extract the data into a data.frame:
bezlist = list()
for (i in 1:(length(x) - 1)) {
bezlist[[i]] = bezf(x1 = x[i], x2 = x[i + 1], y1 = y[i], y2 = y[i + 1], delta = 0.1)
}
xx = unlist(lapply(bezlist, FUN = '[', 'y'))
yy = unlist(lapply(bezlist, FUN = '[', 'y'))
bezdat = data.frame(react = xx, E = yy)
library(ggplot2)
ggplot(bezdat, aes(x = react, y = E)) +
geom_line() +
labs(x = "Reaction coordinate")
You could use a spline fit. Define some points along the energy diagram, and then fit to them using a spline function. The more points that you provide, the better that your fit will be. You can check out the smooth.splines function in the stats package for one implementation of the spline fit.

Resources