I have a bunch of measurements over time and I want to plot them in R. Here is a sample of my data. I've got 6 measurements for each of 4 time points:
values <- c (1012.0, 1644.9, 837.0, 1200.9, 1652.0, 981.5,
2236.9, 1697.5, 2087.7, 1500.8,
2789.3, 1502.9, 2051.3, 3070.7, 3105.4,
2692.5, 1488.5, 1978.1, 1925.4, 1524.3,
2772.0, 1355.3, 2632.4, 2600.1)
time <- factor (rep (c(0, 12, 24, 72), c(6, 6, 6, 6)))
The scale of these data is arbitrary, and in fact I'm going to normalize it so that the average of t=0 is 1.
norm <- values / mean (values[time == 0])
So far so good. Using ggplot, I plot both the individual points, as well as a line that goes through the average at each time point:
require (ggplot2)
p <- ggplot(data = data.frame(time, norm), mapping = aes (x = time, y = norm)) +
stat_summary (fun.y = mean, geom="line", mapping = aes (group = 1)) +
geom_point()
However, now I want to apply a logarithmic scale, and this is where my trouble starts. When I do:
q <- ggplot(data = data.frame(time, norm), mapping = aes (x = time, y = norm)) +
stat_summary (fun.y = mean, geom="line", mapping = aes (group = 1)) +
geom_point() +
scale_y_log2()
The line does NOT go through 0 at t=0, as you would expect because log (1) == 0. Instead the line crosses the y-axis slightly below 0. Apparently, ggplot applies the mean after log transformation, which gives a different result. I want it to take the mean before log transformation.
How can I tell ggplot to apply the mean first? Is there a better way to create this chart?
scale_y_log2() will do the transformation first and then calculate the geoms.
coord_trans() will do the opposite: calculate the geoms first, and the transform the axis.
So you need coord_trans(ytrans = "log2") instead of scale_y_log2()
A work around to solve it, if you don´t want to use coord_trans() and still want to transform the data, is to create a function which will back transform it:
f1 <- function(x) {
log10(mean(10 ^ x))
}
stat_summary (fun.y = f1, geom="line", mapping = aes (group = 1))
The best solution I found for this issue was to use a combo of coord_trans() and scale_y_continuous(breaks = breaks)
As previously suggested, using coord_trans will scale your axis without transforming the data, however it will leave you with an ugly axis.
Setting the limits in coord_trans works for some things, but if you want to fix your axis to have specific labels, you will then include scale_y_continuous with the breaks you'd like set.
coord_trans(y = 'log10') +
scale_y_continuous(breaks = breaks)
Related
Novice R user here wrestling with some arcane details of ggplot
I am trying to produce a plot that charts two data ranges: One plotted as a line, and another plotted on the same plot, but as points. The code is something roughly like this:
ggplot(data1, aes(x = Year, y = Capacity, col = Process)) +
geom_line() +
facet_grid(Country ~ ., scales = "free_y") +
scale_y_continuous(trans = "log10") +
geom_point(data = data2, aes(x = Year, y = Capacity, col = Process))
I've left out some additional cosmetic arguments for the sake of simplicity.
The problem is that the points from the geom_point keep getting cut off by the x axis:
I know the standard fix here would be to adjust the y limits to make room for the points:
scale_y_continuous(limits = c(-100, Y_MAX))
But here there is a separate problem due to the facet grid with free scales, since there is no single value for Y_MAX
I've also tried it using expansions:
scale_y_continuous(expand = c(0.5, 0))
But here, it runs into problems with the log scale, since it multiplies by different values for each facet, producing very wonky results.
I just want to produce enough blank space on the bottom of each facet to make room for the point. Or, alternatively, move each point up a little bit to make room. Is there any easy way to do this in my case?
This might be a good place for scales::pseudo_log_trans, which combines a log transformation with a linear transformation (and a flipped sign log transformation) to retain most of the benefits of a log transformation while also allowing zero and negative values. Adjust the sigma parameter of the function to adjust where the transition from linear to log should happen.
library(ggplot2)
ggplot(data = data.frame(country = rep(c("France","USA"), each = 5),
x = rep(1:5, times = 2),
y = c(10^(2:6), 0, 10^(1:4))),
aes(x,y)) +
geom_point() +
# scale_y_continuous(trans = "log10") +
scale_y_continuous(trans = scales::pseudo_log_trans(),
breaks = c(0, 10^(0:6)),
labels = scales::label_number_si()) +
facet_wrap(~country, ncol = 1, scales = "free_y")
vs. with (trans = "log10"):
I am trying to plot the dbscan clustering result through ggplot2. If I understand it correctly the current dbscan plots noise in black colour with base plot function. Some code first,
library(dbscan)
n <- 100
x <- cbind(
x = runif(5, 0, 10) + rnorm(n, sd = 0.2),
y = runif(5, 0, 10) + rnorm(n, sd = 0.2)
)
plot(x)
kNNdistplot(x, k = 5)
abline(h=.25, col = "red", lty=2)
res <- dbscan::dbscan(x, eps = .25, minPts = 4)
plot(res, x, main = "DBSCAN")
x <- data.frame(x)
ggplot(x, aes(x = x, y=y)) + geom_point(color = res$cluster+1, pch = clusym[res$cluster+1])
+ theme_grey() + ggtitle("(c)") + labs(x ="x", y = "y")
I want two things to do differently here, first trying to plot the clustering output through ggplot(). The difficulty is if I use res$cluster to plot points the plot() will ignore points with 0 labels (which are noise points), and ggplots() will though error as length of res$cluster will be smaller than actual data to plot and if I try to use res$cluster+1 it will give 1 to noise points, which I don't want. And secondly if possible try to do something which clusym[] in package fpc does. It plots clusters with labels 1, 2, 3, ... and ignores 0 labels. Thats fine if my labels for noise points are still 0 and then giving any specific symbol say "*" to noise point with a specific colour lets say grey. I have seen a stack overflow post which tries to do similar thing for convex hull plotting but couldn't still figure out how to do this if I don't want to draw the hull and want a clustering number for each cluster.
A possibility which I thought was first plot the points without noise and then additional adding noise points with the desired colour and symbols to the original plot .
But since the res$cluster length is not equal to x it is thronging error.
ggplot(x, aes(x = x, y=y)) + geom_point(color = res$cluster+1, pch = clusym[res$cluster+1])
+ theme_grey() + ggtitle("(c)") + labs(x ="x", y = "y") + adding noise points
Error: Aesthetics must be either length 1 or the same as the data (100): shape, colour
You should first subset the third column from the output of DBSCAN, tack that onto your original data as a new column (i.e. as cluster), and assign that as a factor.
When you make the ggplot, you can assign color or shape to cluster. As for ignoring the noise points, I would do it as follows.
data <- dataframe with the cluster column (still in numeric form).
data2 <- dplyr::filter(data, cluster > 0)
data2$cluster <- as.factor(data2$cluster)
ggplot(data2, aes(x = x, y = y) +
geom_point(aes(color = `cluster`))
Say I'm measuring 10 personality traits and I know the population baseline. I would like to create a chart for individual test-takers to show them their individual percentile ranking on each trait. Thus, the numbers go from 1 (percentile) to 99 (percentile). Given that a 50 is perfectly average, I'd like the graph to show bars going to the left or right from 50 as the origin line. In bar graphs in ggplot, it seems that the origin line defaults to 0. Is there a way to change the origin line to be at 50?
Here's some fake data and default graphing:
df <- data.frame(
names = LETTERS[1:10],
factor = round(rnorm(10, mean = 50, sd = 20), 1)
)
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor)) +
geom_bar(stat="identity") +
coord_flip()
Picking up on #nongkrong's comment, here's some code that will do what I think you want while relabeling the ticks to match the original range and relabeling the axis to avoid showing the math:
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(breaks=seq(-50,50,10), labels=seq(0,100,10)) + ylab("Percentile") +
coord_flip()
This post was really helpful for me - thanks #ulfelder and #nongkrong. However, I wanted to re-use the code on different data without having to manually adjust the tick labels to fit the new data. To do this in a way that retained ggplot's tick placement, I defined a tiny function and called this function in the label argument:
fix.labels <- function(x){
x + 50
}
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = fix.labels) + ylab("Percentile") +
coord_flip()
I have some data I want to graph on a semi-log scale, however I get some artifacts when there is a large jump between points. On linear scale, a straight line is drawn between subsequent points, which is a fine approximation for visualization. However, the exact same thing is done when using the log scale (either by using scale_x_log10 or scale_x_continuous with a log transformation). A line between two points on the semi-log scale should show up curved. In other words, this:
df <- data.frame(x = c(0, 1), y = c(0, 1))
ggplot(data = df, aes(x, y)) + geom_line() + scale_x_log10(limits = c(10^-3, 10^0))
produces this:
when I would expect something more like this:
generated by this code:
df <- data.frame(x = seq(0, 1, 0.01), y = seq(0, 1, 0.01))
ggplot(data = df, aes(x, y)) + geom_line() + scale_x_log10(limits = c(10^-3, 10^0))
It's clear what's happening, but I'm not sure what the best way to fix the interpolation is. In the actual data I'm plotting there are a few jumps at various points, which makes the plots very misleading when trying to compare two lines. (They're ROC curves in this instance.)
One thought is I can search the data for jumps and fill in some interpolated points myself, but I'm hoping for a cleaner way that doesn't involve me adding in a bunch of fake data points.
What you describe is a transformation of the coordinate system, not a transformation of the scales. The distinction is that scale transformations take place before any statistical transformations, and coordinate transformations take place afterward. In this case, the "statistical transformation" is "draw a straight line between the points". With a transformed scale, the line is straight in the transformed (log) space; with a transformed coordinate, it is straight in the original (linear) space and therefore curved in log space.
# don't include 0 in the data because log 0 is -Inf
DF <- data.frame(x = c(0.1, 1), y = c(0.1, 1))
ggplot(data = DF, aes(x = x, y = y)) +
geom_line() +
coord_trans(x="log10")
I have a bunch of measurements over time and I want to plot them in R. Here is a sample of my data. I've got 6 measurements for each of 4 time points:
values <- c (1012.0, 1644.9, 837.0, 1200.9, 1652.0, 981.5,
2236.9, 1697.5, 2087.7, 1500.8,
2789.3, 1502.9, 2051.3, 3070.7, 3105.4,
2692.5, 1488.5, 1978.1, 1925.4, 1524.3,
2772.0, 1355.3, 2632.4, 2600.1)
time <- factor (rep (c(0, 12, 24, 72), c(6, 6, 6, 6)))
The scale of these data is arbitrary, and in fact I'm going to normalize it so that the average of t=0 is 1.
norm <- values / mean (values[time == 0])
So far so good. Using ggplot, I plot both the individual points, as well as a line that goes through the average at each time point:
require (ggplot2)
p <- ggplot(data = data.frame(time, norm), mapping = aes (x = time, y = norm)) +
stat_summary (fun.y = mean, geom="line", mapping = aes (group = 1)) +
geom_point()
However, now I want to apply a logarithmic scale, and this is where my trouble starts. When I do:
q <- ggplot(data = data.frame(time, norm), mapping = aes (x = time, y = norm)) +
stat_summary (fun.y = mean, geom="line", mapping = aes (group = 1)) +
geom_point() +
scale_y_log2()
The line does NOT go through 0 at t=0, as you would expect because log (1) == 0. Instead the line crosses the y-axis slightly below 0. Apparently, ggplot applies the mean after log transformation, which gives a different result. I want it to take the mean before log transformation.
How can I tell ggplot to apply the mean first? Is there a better way to create this chart?
scale_y_log2() will do the transformation first and then calculate the geoms.
coord_trans() will do the opposite: calculate the geoms first, and the transform the axis.
So you need coord_trans(ytrans = "log2") instead of scale_y_log2()
A work around to solve it, if you don´t want to use coord_trans() and still want to transform the data, is to create a function which will back transform it:
f1 <- function(x) {
log10(mean(10 ^ x))
}
stat_summary (fun.y = f1, geom="line", mapping = aes (group = 1))
The best solution I found for this issue was to use a combo of coord_trans() and scale_y_continuous(breaks = breaks)
As previously suggested, using coord_trans will scale your axis without transforming the data, however it will leave you with an ugly axis.
Setting the limits in coord_trans works for some things, but if you want to fix your axis to have specific labels, you will then include scale_y_continuous with the breaks you'd like set.
coord_trans(y = 'log10') +
scale_y_continuous(breaks = breaks)