Plotting in R using stat_function on a logarithmic scale

Plotting in R using stat_function on a logarithmic scale - r

I'm having serious problems trying to get my head around stat_function in R's ggplot2. I started off with this trivial example:
ggplot(data.frame(x = c(1, 1e4)), aes(x)) + stat_function(fun = function(x) x)
which works as expected. Unfortunately, when I add log scales for both x and y axes so:
ggplot(data.frame(x = 1:1e4), aes(x)) +
scale_x_log10() +
scale_y_log10() +
stat_function(fun = function(x) x)
I get the following result, which is a pretty nasty violation of the identity function.
Is there something very basic that I'm missing? What is then the correct and least hacky way to plot a function on log scale?
EDIT:
Inspired by the answers I went on and experimented with scales and the aesthetics parameter. I was even more puzzled to find out that I got what I expected using the code below:
ggplot(data.frame(x = 1:1e4, y = 1:1e4), aes(x, y)) +
scale_x_log10() +
scale_y_log10() +
stat_function(fun = function(x) x)
with an apparently unused vector of y values (unused by stat_function that is). Do the axis transformations depend on the availability of data?

When you use scale_x_log10() then x values are log transformed, and only then used for calculation of y values with stat_function(). Then x values are backtransformed to original values to make scale. y values remain as calculated from log transformed x. You can check this by plotting values without scale_y_log10(). In plot there is straight line.
ggplot(data.frame(x=1:1e4), aes(x)) +
stat_function(fun = function(x) x) +
scale_x_log10()
If you apply scale_y_log10() you log transform already calculated y values, so curve is plotted.

In ggplot2, the rule is that scale transformation precedes statistical transformation which in turn precedes coordinate transformation. In this context, the function (via stat_function()) is the statistical transformation.
If you use a scale_x/y_*() function in a ggplot2 call, it will apply the scale transformation(s) first before computing the function.
Case 0: Plot in the original scales of x and y.
ggplot(data.frame(x = 1:1e4, y = 1:1e4), aes(x, y)) +
stat_function(fun = function(x) x)
Case 1a: Both x and y are log transformed before the function is computed because of the presence of scale_x/y_log10(). You can see this from the values on their respective scales (compare to Case 0).
ggplot(data.frame(x = 1:1e4, y = 1:1e4), aes(x, y)) +
stat_function(fun = function(x) x) +
scale_x_log10() +
scale_y_log10()
Case 1b: x is log transformed in the original data frame. Consequently, the function actually operates on the log10(x) values, so will still be a straight line, but on the log10 scale in both x and y.
ggplot(data.frame(x = log10(seq(1e4)), y = seq(1e4)), aes(x, y)) +
stat_function(fun = function(x) x)
Case 1c: The same as 1b, with one exception: the x-scale is in the original units but the y-scale is in log10(x) units, because the scale transformation on x occurs before the statistical transformation f(y) = y is computed, where y = log10(x).
ggplot(data.frame(x = seq(1e4), y = seq(1e4)), aes(x, y)) +
stat_function(fun = function(x) x) +
scale_x_log10()
Case 2: By contrast, coordinate transformations take place after statistical transformation; i.e., the function is computed in the original units first and then the coordinate transformation on x takes place, which warps the function:
ggplot(data.frame(x = seq(1e4), y = seq(1e4)), aes(x, y)) +
stat_function(fun = function(x) x) +
coord_trans(xtrans = "log10")
...unless, of course, you apply the same transformation to both x and y:
ggplot(data.frame(x = seq(1e4), y = seq(1e4)), aes(x, y)) +
stat_function(fun = function(x) x) +
coord_trans(xtrans = "log10", ytrans = "log10")

Related

Use of inverse parameter in trans_new scales package

I've been trying to use the function trans_new with the scales package however I can't get it to display labels correctly
# percent to fold change
fun1 <- function(x) (x/100) + 1
# fold change to percent
inv_fun1 <- function(x) (x - 1) * 100
percent_to_fold_change_trans <- trans_new(name = "transform", transform = fun1, inverse = inv_fun1)
plot_data <- data.frame(x = 1:10,
y = inv_fun1(1:10))
# Plot raw data
p1 <- ggplot(plot_data, aes(x = x, y = y)) +
geom_point()
# This doesn't really change the plot
p2 <- ggplot(plot_data, aes(x = x, y = y)) +
geom_point() +
coord_trans(y = percent_to_fold_change_trans)
p1 and p2 are identical whereas I'm expecting p2 to be a diagonal line since we are reversing the inverting function. If I replace the inverse parameter in trans_new with another function (like fun(x) x) I can see the correct transformation but the labels are completely off. Any ideas of how to define the inverse parameters to get the right label positions?

You wouldn't expect a linear function like fun1 to change the appearance of the y axis. Remember, you are not transforming the data, you are transforming the y axis. This means that you are effectively changing the positions of the horizontal gridlines, but not the values they represent.
Any function that produces a linear transformation will result in fixed spacing between the horizontal grid lines, which is what you have already. The plot therefore won't change.
Let's take a simple example:
plot_data <- data.frame(x = 1:10, y = 1:10)
p <- ggplot(plot_data, aes(x = x, y = y)) +
geom_point() +
scale_y_continuous(breaks = 1:10)
p
Now let's create a straightforward non-linear transformation:
little_trans <- trans_new(name = "transform",
transform = function(x) x^2,
inverse = function(x) sqrt(x))
p + coord_trans(y = little_trans)
Note the values on the y axis are the same, but because we applied a non-linear transformation, the distances between the gridlines now varies.
In fact, if we plot a transformed version of our data, we would get the same shape:
ggplot(plot_data, aes(x = x, y = y^2)) +
geom_point() +
scale_y_continuous(breaks = (1:10)^2)
In a sense, this is all that the transform does, except it applies the inverse transform to the axis labels. We could do that manually here:
ggplot(plot_data, aes(x = x, y = y^2)) +
geom_point() +
scale_y_continuous(breaks = (1:10)^2, labels = sqrt((1:10)^2))
Now, suppose I instead do a more complicated but linear function of x:
little_trans <- trans_new(name = "transform",
transform = function(x) (0.1 * x + 20) / 3,
inverse = function(x) (x * 3 - 20) / 0.1)
ggplot(plot_data, aes(x = x, y = y)) +
geom_point() +
coord_trans(y = little_trans)
It's unchanged from before. We can see why if we again apply our transform directly:
ggplot(plot_data, aes(x = x, y = (0.1 * y + 20) / 3)) +
geom_point() +
scale_y_continuous(breaks = (0.1 * (1:10) + 20) / 3)
Obviously, if we do the inverse transform on the axis labels we will have 1:10, which means we will just have the original plot back.
The same holds true for any linear transform, and therefore the results you are getting are exactly what are to be expected.

Replacing geom_rug with geom_boxplot for huge data

I'm trying to visualize a relationship between continuous x and binary y (inspiration)
set.seed(1032490)
NN = 2e5
DF = data.frame(x = rlnorm(NN))
DF$y = as.numeric(DF$x - mean(DF$x) + rnorm(NN) > 0)
ggplot(DF, aes(x, y)) +
stat_smooth(method = 'gam') +
geom_rug(sides = 'b')
Of course with this many points, a rug is not very useful, and it also slows down plotting considerably.
Faster and more interpretable would be to replace geom_rug with a boxplot (or other distribution-summarizing plot).
Is there any out-of-the-box way to do so? I played around with geom_boxplot and checked the documentation to no avail.

You can use geom_boxploth from package ggstance, although I am not sure this is your desired output?
library(ggstance)
ggplot(DF, aes(x, y)) +
stat_smooth(method = 'gam') +
geom_boxploth(aes(y = -1, x = x))

R: ggplot distance formula

I'm trying to plot isoclines under a scatterplot using ggplot but I can't figure out how to use stat_functioncorrectly.
The isoclines are based on the distance formula:
sqrt((x1-x2)^2 + (y1-y2)^2)
and would look like these
concentric circles, except the center would be the origin of the plot:
What I've tried so far is calling the distance function within ggplot like so (Note: I use x1=1 and y1=1 because in my real problem I also have fixed values)
distance <- function(x, y) {sqrt((x - 1)^2 + (y - 1)^2)}
ggplot(my_data, aes(x, y))+
geom_point()+
stat_function(fun=distance)
but R returns the error:
Computation failed in 'stat_function()': argument "y" is missing, with
no default
How do I correctly feed x and y values to stat_function so that it plots a generic plot of the distance formula, with the center at the origin?

For anything a bit complicated, I avoid the use of the stat functions. They are mostly aimed at quick calculations. They are usually limited to calculating y based on x. I would just pre-calculate the data and the plot with stat_contour instead:
distance <- function(x, y) {sqrt((x - 1)^2 + (y - 1)^2)}
d <- expand.grid(x = seq(0, 2, 0.02), y = seq(0, 2, 0.02))
d$dist <- mapply(distance, x = d$x, y = d$y)
ggplot(d, aes(x, y)) +
geom_raster(aes(fill = dist), interpolate = T) +
stat_contour(aes(z = dist), col = 'white') +
coord_fixed() +
viridis::scale_fill_viridis(direction = -1)

R + ggplot: geom_txt label not recognize a variable in function call

I'm an R/ggplot newbie switching over from MatLab.
I would like to create a function using ggplot with linear regression equation printed on the graph (which is discussed in Adding Regression Line Equation and R2 on graph). But here, I am trying to build a function with it but wasn't successful.
I kept getting an error -
"Error in eval(expr, envir, enclos) : object 'label' not found".
One workaround is to define "label" variable outside of the function but I just don't understand why this doesn't work.
Can anyone explain why?
df <- data.frame(x = c(1:100))
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)
f <- function(DS, x, y, z) {
label <- z
print(label)
ggplot(DS, aes(x=x, y=y)) +
geom_point() +
labs(y=y) +
labs(title=y) +
xlim(0,5)+
ylim(0,5)+
geom_smooth(method="lm", se=FALSE)+
geom_text (aes(x=1, y=4, label=label))
}
f(df, x, y, "aaa") #execution line

See the following code:
library(ggplot2)
df <- data.frame(x = c(1:100))
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)
f <- function(DS, x, y, z) {
label.df = data.frame(x=1, y=4, label=z)
ggplot(DS, aes_string(x=x, y=y)) +
geom_point() +
labs(y=y) +
labs(title=y) +
geom_smooth(method="lm", se=FALSE)+
geom_text (aes(x=x, y=y, label=label), label.df)
}
f(df, "x", "y", "aaa")
There were a few fixes about your code:
The data you are using in geom_text is the same you have defined in ggplot() unless you change it. Here I have created a temporary data.frame for this purpose called label.df.
The xlim() and ylim() functions were filtering most of your data points, since the range of x and y are much larger than the limits you defined in the original code.
When you want to pass the names of the columns of your data.frame to be used for displaying the graph it would be easier to pass their names as strings (i.e. "x"). This way, the aes() function is also changed to aes_string().
Here is the result:
Edit
Thanks to #Gregor, a simpler version would be:
f <- function(DS, x, y, z) {
ggplot(DS, aes_string(x=x, y=y)) +
geom_point() +
labs(y=y) +
labs(title=y) +
geom_smooth(method="lm", se=FALSE)+
annotate(geom="text", x=1, y=4, label=z)
}

Plotting a function and a derivative function

I would like to plot a dataframe (X,Y) data together with a fitted function and the derivative of the fitted function.
fit <- lm(data$Y ~ poly(data$X,32,raw=TRUE))
data$fitted_values <- predict(fit, data.frame(x=data$X))
As far as I understood, this gives me a polynomial function of the 32nd degree, fit, that I use to calculate the function values and store them in data$fitted. Plotting these series works like a charm with ggplot2.
ggplot(data, aes(x=X)) +
geom_line(aes(y = Y), colour="red") +
geom_line(aes(y = predict), colour="blue")
So far so good. But what I'm would like to plot too is the first derivative, data$Y', of the fitted function fit. What I'm interested in is the gradient of the fitted function.
My Question: How can I get the derivative function of fit?
I assume I can then "predict" the absolute values for plotting afterwards. Correct?

First, i'll create some test data that "kind of" looks like yours
set.seed(15)
rr<-density(faithful$eruptions)
dd<-data.frame(x=rr$x)
dd$y=rr$y+ runif(8,0,.05)
fit <- lm(y ~ poly(x,32,raw=TRUE), dd)
dd$fitted <- fitted(fit)
ggplot(dd, aes(x=x)) +
geom_line(aes(y = y), colour="red") +
geom_line(aes(y = fitted), colour="blue")
Then, because you have a special form of a polynomial we can easily calculate the derivative by multiplying each of the coefficients by the power and shifting all the terms down. Here's a helper function to calcualte the new coefficients
deriv_coef<-function(x) {
x <- coef(x)
stopifnot(names(x)[1]=="(Intercept)")
y <- x[-1]
stopifnot(all(grepl("^poly", names(y))))
px <- as.numeric(gsub("poly\\(.*\\)","",names(y)))
rr <- setNames(c(y * px, 0), names(x))
rr[is.na(rr)] <- 0
rr
}
which we can use like...
dd$slope <- model.matrix(fit) %*% matrix(deriv_coef(fit), ncol=1)
And now I can plot
ggplot(dd, aes(x=x)) +
geom_line(aes(y = y), colour="red") +
geom_line(aes(y = fitted), colour="blue") +
geom_line(aes(y = slope), colour="green")
and we can see that the inflection points correspond to places where the derivative is zero.

You can approximate the derivative by first sorting the data with respect to X, then finding the differences between each pair of consecutive values.
data <- d[order(d$X), ]
data$derivative = c(diff(d$fitted_values) / diff(d$X), NA)
(Note how I added an NA to the end, since taking the differences makes it slightly shorter). Afterwards you can plot this:
ggplot(data, aes(X, derivative)) + geom_line()

Allegedly the quantchem package can do it with the derivative function.
Description
Calculate derivative of polynomial for given x.
Usage
derivative(obj, x)
Arguments
obj: an object of class 'lm', fitted in y ~ x + I(x^2) + I(x^3) + ...
way.
x: a vector of x values
Examples
x = 1:10 y = jitter(x+x^2)
fit = lm(y~x+I(x^2))
derivative(fit,1:10)
Source
Note: All of this said, it didn't work for me and my data.