ggplot boxplot on log scale, mean via stat_summary appears wrong [duplicate] - r

I have a bunch of measurements over time and I want to plot them in R. Here is a sample of my data. I've got 6 measurements for each of 4 time points:
values <- c (1012.0, 1644.9, 837.0, 1200.9, 1652.0, 981.5,
2236.9, 1697.5, 2087.7, 1500.8,
2789.3, 1502.9, 2051.3, 3070.7, 3105.4,
2692.5, 1488.5, 1978.1, 1925.4, 1524.3,
2772.0, 1355.3, 2632.4, 2600.1)
time <- factor (rep (c(0, 12, 24, 72), c(6, 6, 6, 6)))
The scale of these data is arbitrary, and in fact I'm going to normalize it so that the average of t=0 is 1.
norm <- values / mean (values[time == 0])
So far so good. Using ggplot, I plot both the individual points, as well as a line that goes through the average at each time point:
require (ggplot2)
p <- ggplot(data = data.frame(time, norm), mapping = aes (x = time, y = norm)) +
stat_summary (fun.y = mean, geom="line", mapping = aes (group = 1)) +
geom_point()
However, now I want to apply a logarithmic scale, and this is where my trouble starts. When I do:
q <- ggplot(data = data.frame(time, norm), mapping = aes (x = time, y = norm)) +
stat_summary (fun.y = mean, geom="line", mapping = aes (group = 1)) +
geom_point() +
scale_y_log2()
The line does NOT go through 0 at t=0, as you would expect because log (1) == 0. Instead the line crosses the y-axis slightly below 0. Apparently, ggplot applies the mean after log transformation, which gives a different result. I want it to take the mean before log transformation.
How can I tell ggplot to apply the mean first? Is there a better way to create this chart?

scale_y_log2() will do the transformation first and then calculate the geoms.
coord_trans() will do the opposite: calculate the geoms first, and the transform the axis.
So you need coord_trans(ytrans = "log2") instead of scale_y_log2()

A work around to solve it, if you don´t want to use coord_trans() and still want to transform the data, is to create a function which will back transform it:
f1 <- function(x) {
log10(mean(10 ^ x))
}
stat_summary (fun.y = f1, geom="line", mapping = aes (group = 1))

The best solution I found for this issue was to use a combo of coord_trans() and scale_y_continuous(breaks = breaks)
As previously suggested, using coord_trans will scale your axis without transforming the data, however it will leave you with an ugly axis.
Setting the limits in coord_trans works for some things, but if you want to fix your axis to have specific labels, you will then include scale_y_continuous with the breaks you'd like set.
coord_trans(y = 'log10') +
scale_y_continuous(breaks = breaks)

Related

Points keep getting cut off, and standard fixes don't work well with facet grid on a log scale

Novice R user here wrestling with some arcane details of ggplot
I am trying to produce a plot that charts two data ranges: One plotted as a line, and another plotted on the same plot, but as points. The code is something roughly like this:
ggplot(data1, aes(x = Year, y = Capacity, col = Process)) +
geom_line() +
facet_grid(Country ~ ., scales = "free_y") +
scale_y_continuous(trans = "log10") +
geom_point(data = data2, aes(x = Year, y = Capacity, col = Process))
I've left out some additional cosmetic arguments for the sake of simplicity.
The problem is that the points from the geom_point keep getting cut off by the x axis:
I know the standard fix here would be to adjust the y limits to make room for the points:
scale_y_continuous(limits = c(-100, Y_MAX))
But here there is a separate problem due to the facet grid with free scales, since there is no single value for Y_MAX
I've also tried it using expansions:
scale_y_continuous(expand = c(0.5, 0))
But here, it runs into problems with the log scale, since it multiplies by different values for each facet, producing very wonky results.
I just want to produce enough blank space on the bottom of each facet to make room for the point. Or, alternatively, move each point up a little bit to make room. Is there any easy way to do this in my case?
This might be a good place for scales::pseudo_log_trans, which combines a log transformation with a linear transformation (and a flipped sign log transformation) to retain most of the benefits of a log transformation while also allowing zero and negative values. Adjust the sigma parameter of the function to adjust where the transition from linear to log should happen.
library(ggplot2)
ggplot(data = data.frame(country = rep(c("France","USA"), each = 5),
x = rep(1:5, times = 2),
y = c(10^(2:6), 0, 10^(1:4))),
aes(x,y)) +
geom_point() +
# scale_y_continuous(trans = "log10") +
scale_y_continuous(trans = scales::pseudo_log_trans(),
breaks = c(0, 10^(0:6)),
labels = scales::label_number_si()) +
facet_wrap(~country, ncol = 1, scales = "free_y")
vs. with (trans = "log10"):

R - ggplot2 change x-axis values to non-log values

I am plotting some payment distribution information and I aggregated the data after scaling it to log-normal (base-e). The histograms turn out great but I want to modify the x-axis to display the non-log equivalents.
My current axis displays [0:2.5:10] values
Alternatively, I would like to see values for exp(2.5), exp(5), etc.
Any suggestions on how to accomplish this? Anything I can add to my plotting statement to scale the x-axis values? Maybe there's a better approach - thoughts?
Current code:
ggplot(plotData, aes_string(pay, fill = pt)) + geom_histogram(bins = 50) + facet_wrap(~M_P)
Answered...Final plot:
Not sure if this is exactly what you are after but you can change the text of the x axis labels to whatever you want using scale_x_continuous.
Here's without:
ggplot(data = cars) + geom_histogram(aes(x = speed), binwidth = 1)
Here's with:
ggplot(data = cars) + geom_histogram(aes(x = speed), binwidth = 1) +
scale_x_continuous(breaks=c(5,10,15,20,25), labels=c(exp(5), exp(10), exp(15), exp(20), exp(25)))

How to change origin line position in ggplot bar graph?

Say I'm measuring 10 personality traits and I know the population baseline. I would like to create a chart for individual test-takers to show them their individual percentile ranking on each trait. Thus, the numbers go from 1 (percentile) to 99 (percentile). Given that a 50 is perfectly average, I'd like the graph to show bars going to the left or right from 50 as the origin line. In bar graphs in ggplot, it seems that the origin line defaults to 0. Is there a way to change the origin line to be at 50?
Here's some fake data and default graphing:
df <- data.frame(
names = LETTERS[1:10],
factor = round(rnorm(10, mean = 50, sd = 20), 1)
)
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor)) +
geom_bar(stat="identity") +
coord_flip()
Picking up on #nongkrong's comment, here's some code that will do what I think you want while relabeling the ticks to match the original range and relabeling the axis to avoid showing the math:
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(breaks=seq(-50,50,10), labels=seq(0,100,10)) + ylab("Percentile") +
coord_flip()
This post was really helpful for me - thanks #ulfelder and #nongkrong. However, I wanted to re-use the code on different data without having to manually adjust the tick labels to fit the new data. To do this in a way that retained ggplot's tick placement, I defined a tiny function and called this function in the label argument:
fix.labels <- function(x){
x + 50
}
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = fix.labels) + ylab("Percentile") +
coord_flip()

Interpolating correctly between points in R using ggplot2 and axis scaling

I have some data I want to graph on a semi-log scale, however I get some artifacts when there is a large jump between points. On linear scale, a straight line is drawn between subsequent points, which is a fine approximation for visualization. However, the exact same thing is done when using the log scale (either by using scale_x_log10 or scale_x_continuous with a log transformation). A line between two points on the semi-log scale should show up curved. In other words, this:
df <- data.frame(x = c(0, 1), y = c(0, 1))
ggplot(data = df, aes(x, y)) + geom_line() + scale_x_log10(limits = c(10^-3, 10^0))
produces this:
when I would expect something more like this:
generated by this code:
df <- data.frame(x = seq(0, 1, 0.01), y = seq(0, 1, 0.01))
ggplot(data = df, aes(x, y)) + geom_line() + scale_x_log10(limits = c(10^-3, 10^0))
It's clear what's happening, but I'm not sure what the best way to fix the interpolation is. In the actual data I'm plotting there are a few jumps at various points, which makes the plots very misleading when trying to compare two lines. (They're ROC curves in this instance.)
One thought is I can search the data for jumps and fill in some interpolated points myself, but I'm hoping for a cleaner way that doesn't involve me adding in a bunch of fake data points.
What you describe is a transformation of the coordinate system, not a transformation of the scales. The distinction is that scale transformations take place before any statistical transformations, and coordinate transformations take place afterward. In this case, the "statistical transformation" is "draw a straight line between the points". With a transformed scale, the line is straight in the transformed (log) space; with a transformed coordinate, it is straight in the original (linear) space and therefore curved in log space.
# don't include 0 in the data because log 0 is -Inf
DF <- data.frame(x = c(0.1, 1), y = c(0.1, 1))
ggplot(data = DF, aes(x = x, y = y)) +
geom_line() +
coord_trans(x="log10")

R ggplot2: using stat_summary (mean) and logarithmic scale

I have a bunch of measurements over time and I want to plot them in R. Here is a sample of my data. I've got 6 measurements for each of 4 time points:
values <- c (1012.0, 1644.9, 837.0, 1200.9, 1652.0, 981.5,
2236.9, 1697.5, 2087.7, 1500.8,
2789.3, 1502.9, 2051.3, 3070.7, 3105.4,
2692.5, 1488.5, 1978.1, 1925.4, 1524.3,
2772.0, 1355.3, 2632.4, 2600.1)
time <- factor (rep (c(0, 12, 24, 72), c(6, 6, 6, 6)))
The scale of these data is arbitrary, and in fact I'm going to normalize it so that the average of t=0 is 1.
norm <- values / mean (values[time == 0])
So far so good. Using ggplot, I plot both the individual points, as well as a line that goes through the average at each time point:
require (ggplot2)
p <- ggplot(data = data.frame(time, norm), mapping = aes (x = time, y = norm)) +
stat_summary (fun.y = mean, geom="line", mapping = aes (group = 1)) +
geom_point()
However, now I want to apply a logarithmic scale, and this is where my trouble starts. When I do:
q <- ggplot(data = data.frame(time, norm), mapping = aes (x = time, y = norm)) +
stat_summary (fun.y = mean, geom="line", mapping = aes (group = 1)) +
geom_point() +
scale_y_log2()
The line does NOT go through 0 at t=0, as you would expect because log (1) == 0. Instead the line crosses the y-axis slightly below 0. Apparently, ggplot applies the mean after log transformation, which gives a different result. I want it to take the mean before log transformation.
How can I tell ggplot to apply the mean first? Is there a better way to create this chart?
scale_y_log2() will do the transformation first and then calculate the geoms.
coord_trans() will do the opposite: calculate the geoms first, and the transform the axis.
So you need coord_trans(ytrans = "log2") instead of scale_y_log2()
A work around to solve it, if you don´t want to use coord_trans() and still want to transform the data, is to create a function which will back transform it:
f1 <- function(x) {
log10(mean(10 ^ x))
}
stat_summary (fun.y = f1, geom="line", mapping = aes (group = 1))
The best solution I found for this issue was to use a combo of coord_trans() and scale_y_continuous(breaks = breaks)
As previously suggested, using coord_trans will scale your axis without transforming the data, however it will leave you with an ugly axis.
Setting the limits in coord_trans works for some things, but if you want to fix your axis to have specific labels, you will then include scale_y_continuous with the breaks you'd like set.
coord_trans(y = 'log10') +
scale_y_continuous(breaks = breaks)

Resources