The Problem
I have data that I would like to plot in a line-graph with a log-scale on the y-axis using ggplot2. Unfortunately, some of my values go all the way down to zero. The data represents relative occurences of a feature in dependence of some parameters. The value zero occurs when that feature is not observed in a sample, which means that it occurs very seldomly, or indeed never. These zero values cause a problem in the log plot.
The following code illustrates the problem on a simplified data set. In reality the data set consists of more points, so the curve looks smoother, and also more values for the parameter p.
library(ggplot2)
dat <- data.frame(x=rep(c(0, 1, 2, 3), 2),
y=c(1e0, 1e-1, 1e-4, 0,
1e-1, 1e-3, 0, 0),
p=c(rep('a', 4), rep('b', 4)))
qplot(data=dat, x=x, y=y, colour=p, log="y", geom=c("line", "point"))
Given the data above, we would expect two lines, the first one should have three finite points on a log plot, the second one should have only two finite points on a log plot.
However, as you can see this produces a very misleading plot. It looks like the blue and red line are both converging to a value between 1e-4 and 1e-3. The reason is that log(0) gives -Inf, which ggplot just puts on the lower axis.
My Question
What's the best way to deal with this in R with ggplot2? By best I mean in terms of efficiency, and being ideomatic R (I'm fairly new to R).
The plot should indicate that these curves go down to "very small" after x=2 (red), or x=1 (blue), respectively. Ideally, with a vertical line downwards from the last finite point. What I mean by that is demonstrated in the following.
My Attempt
Here I'll describe what I've come up with. However, given that I'm fairly new to R, I suspect that there might a much better way.
library(ggplot2)
library(scales)
dat <- data.frame(x=rep(c(0, 1, 2, 3), 2),
y=c(1e0, 1e-1, 1e-4, 0,
1e-1, 1e-3, 0, 0),
p=c(rep('a', 4), rep('b', 4)))
Same data as above.
Now, I'm going through each unique parameter p, find the x coordinate of the last finite point, and assign it to the x coordinates of all points where y is zero. That is to achieve a vertical line.
for (p in unique(dat$p)) {
dat$x[dat$p == p & dat$y == 0] <- dat$x[head(which(dat$p == p & dat$y == 0), 1) - 1]
}
At this point the plot looks as follows.
The vertical lines are there. However, there are also points. These are misleading as they indicate that there was an actual data point there, which is not true.
To remove the points I duplicate the y data (seems wasteful), let's call it yp, and replace zero by NA. Then I use that new yp as the y aesthetics for geom_point.
dat$yp <- dat$y
dat$yp[dat$y == 0] <- NA
ggplot(dat, aes(x=x, y=y, colour=p)) +
geom_line() +
geom_point(aes(y=dat$yp)) +
scale_y_continuous(trans=log10_trans(),
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
Where I've used ggplot instead of qplot so that I can give different aesthetics to geom_line and geom_point.
Finally, the plot looks like this.
What is the right way to do this?
For me, I use
+ scale_y_continuous(trans=scales::pseudo_log_trans(base = 10))
If you're using ggplot, you can use scales::pseudo_log_trans() as your transformation object. This will replace your -inf with 0.
From the docs (https://scales.r-lib.org/reference/pseudo_log_trans.html),
A transformation mapping numbers to a signed logarithmic scale with a smooth transition to linear scale around 0.
pseudo_log_trans(sigma = 1, base = exp(1))
For example, my scale expression looks like this:
+ scale_fill_gradient(name = "n occurrences", trans="pseudo_log")
Unconfirmed, but you probably need to include the scales library:
require("scales")
The simplest way would be to add a small value to each of the numbers. Example,
df <- mutate(df, log_var = log(var + 0.01))
ggplot(df, aes(x = log_var)) + geom_histogram()
Related
I'm doing a common R ggplot2 graph with boxplot: boxplots supplemented individual samples as points shown by geom_jitter(), to show the individual sample positions and numbers in each group. Normally I have not noticed a problem, but with some recent data, I've noticed substantial inaccuracy and variation in the y position of the jitter. However, the boxplot stays stable with respect to the Y, and so does geom_point() when used to show the same points as jitter is plotting. Error is likely not noticeable when you have many data points, but if required to do something with 5-10 samples in a group, it can produce an obvious error that makes a plot that may mislead you, if you were not aware of the issue.
I first thought this may have always happened and I didn't notice, so I made some random numbers and made a ggplot with geom_jitter(), but at first the problem disappeared. Some example data and plots are given to show the normal and problematic cases.
Data generation and plotting that worked as expected:
df <- data.frame("X" = rep("X", 5), "Y" = rnorm(5, 100, 30))
check the plot:
library(ggplot2)
ggplot(df, aes(X, Y)) + geom_boxplot() + geom_jitter(col = "red") + geom_point(col = "blue")
The red and blue dots are almost exactly aligned, and you can just watch the plot come in RStudio preview if you repeat the code 5 times and not notice variation in the jitter point y-position (only horizontally along the X-axis, as expected). In a problematic case like below, you quickly see the y-axis point variation, especially because it sometimes shifts the range of the y-axis.
With more variation in random numbers, I found a difference visible between the red and blue points, which varied each time of plotting the same data:
df <- data.frame("X" = rep("X", 5), "Y" = rnorm(5, 100, 400))
The actual numbers to get this problem were:
X Y
1 X 610.78026
2 X -38.58905
3 X -196.00943
4 X 94.37797
5 X 415.58417
In my result, the lowest point, -196, sometimes was about -170, sometimes about -250. The range of the y-axis shifts each time. It's similar to the problem I had happen with real data.
I found with other testing of data having more variance, or a larger range between points, did not explain occurrence variability of the jitter y-position. In some cases with more variance, geom_jitter() again produced near perfect y-positions. So I wondered if it may have something to do with trouble mapping with certain plot areas used by ggplot2. I thought to test that by forcing ggplot to keep the same ylimit using ylim(-206, 621) but it failed to stop the area with the above problematic case. It gives a mysterious, yet consistent error of: "Warning message: Removed 1 rows containing missing values (geom_point)."
(In the corresponding plot, it lost the red jitter point for the 610.7 value, despite enough pixel space in the plot preview window for about 10 more points between the blue point and the top of the graph. In another attempt, 2 jitter points get lost, because the bottom sometimes goes past the lower limit).
A roundabout solution would be to make random points for the X group, all keeping the same Y and group identity, but it's not efficient. When non-numerical groups are used on X, I found it will have a numerical position of 1 for any labels being added. Adding the following to the last dataframe gives the proper appearance + geom_point(aes(x= rnorm(5, 1, .2), y = Y), col = "yellow") - but that would become quite cumbersome if there are many groups if there is not some way to automatically get the correct X position for groups of boxplots.
To solve the problem, any input on what the cause of it is would be a great help.
It sounds like you do not want the default geom_jitter behavior, which adds a uniformly distributed amount of noise separately to the x and y value before plotting, by default "40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins."
For a continuous variable like yours "resolution" is "the smallest non-zero distance between adjacent values."
Try this:
geom_jitter(col = "red", height = 0) +
That will tell ggplot you want no noise applied to the y values before plotting.
Another approach would be to add noise yourself before the plotting step, giving you the ability to control its distribution and range specifically.
e.g. instead of having the jittering fill a uniform rectangle: ...
library(dplyr)
tibble(x = rep(1:2, each = 1000),
y = rep(3:4, each = 1000)) -> point_data
ggplot(point_data, aes(x,y)) + geom_jitter()
We could add whatever noise function we want. Here, for no particular reason, I make donuts around the real data, and compare that to the default jitter:
point_data %>%
mutate(angle = runif(2000, 0, 2*pi),
dist = rnorm(2000, 0.3, 0.05),
x2 = x + dist*cos(angle),
y2 = y + dist*sin(angle)) %>%
ggplot() +
geom_jitter(aes(x,y), color = "red", alpha = 0.2) +
geom_point(aes(x2,y2))
I am attempting to place individual points on a plot using ggplot2, however as there are many points, it is difficult to gauge how densely packed the points are. Here, there are two factors being compared against a continuous variable, and I want to change the color of the points to reflect how closely packed they are with their neighbors. I am using the geom_point function in ggplot2 to plot the points, but I don't know how to feed it the right information on color.
Here is the code I am using:
s1 = rnorm(1000, 1, 10)
s2 = rnorm(1000, 1, 10)
data = data.frame(task_number = as.factor(c(replicate(100, 1),
replicate(100, 2))),
S = c(s1, s2))
ggplot(data, aes(x = task_number, y = S)) + geom_point()
Which generates this plot:
However, I want it to look more like this image, but with one dimension rather than two (which I borrowed from this website: https://slowkow.com/notes/ggplot2-color-by-density/):
How do I change the colors of the first plot so it resembles that of the second plot?
I think the tricky thing about this is you want to show the original values, and evaluate the density at those values. I borrowed ideas from here to achieve that.
library(dplyr)
data = data %>%
group_by(task_number) %>%
# Use approxfun to interpolate the density back to
# the original points
mutate(dens = approxfun(density(S))(S))
ggplot(data, aes(x = task_number, y = S, colour = dens)) +
geom_point() +
scale_colour_viridis_c()
Result:
One could, of course come up with a meausure of proximity to neighbouring values for each value... However, wouldn't adjusting the transparency basically achieve the same goal (gauging how densely packed the points are)?
geom_point(alpha=0.03)
I have a set of data which goes from: 8e41 to ~ 1e44. I want to plot starting at 1e41 on the y-axis, however using the argument: ylim=c(1e41,1e45) does not work for me.
Here is the minimally reproducible code:
x = c(1,2)
y = c(1e41,1e44)
plot(x,y,ylim=c(1e41,1e45))
The problem is that 1e41 is so much closer to 0 than 1e45 that it's virtually the same. Have you considered working on the log scale?
plot(x,y,ylim=c(1e41,1e45), log = 'y')
or even
plot(x, y, log = 'y')
Think of this another way - rescale your data by dividing your range by 1e41: c(8e41, 1e44)/ 1e41 - you get 8 and 1000. Is there any significant difference between starting the scale at 0 (or 1) versus 8? If you chose to divide by 1e40 instead, you're looking at 80 and 10,000. Try the following code to see this:
m <- 1e41 # change this as desired
plot(x, y / m)
abline(h = c(0, 1e41 / m))
By changing m, the only thing that changes is the numbers on the y-axis, the relative positions do not change. Look at how close 0 and 8e41 are, and you'll see why it really doesn't matter whether or not the plot starts at 0 versus 1e41. As a fraction of the total height of the plot, the difference is 1/1000.
Changing the values at which the axis is labeled
Here's one more option for you - changing the values at which the plot is marked. That requires two steps - first, removing the axis labels when the plot is originally created, then adding in the ones you actually want:
plot(x, y, yaxt = 'none')
axis(2, c(1e41, seq(1e43, 1e44, 1e43)))
library(ggplot2)
x = c(1,2)
y = c(1e41,1e44)
data = data.frame(x,y)
ggplot(data, aes(x=x, y=log(y))) + geom_point() + ylim(90,150)
I think you should use the log of y, as it shows the same data.
I am just starting out in R. I want to plot interval of times (the distribution is exponential) on x axis, with a tick mark in place every time the interval ends. If I have a string of times say (0.2, 0.8, 0.9 , 1.0) then the tick marks on the x axis would be on 0.2, 0.8, 0.9 and 1 respectively. With big data samples, I want my graph to look something like:
So after using,
set.seed(1)
x <- rexp(50, 0.2)
How can I go about it further, might I have to use rug function (which I am trying to learn how to use)? Can I also put time stamps on this graph?
Edit
So I have modified my command and used:
x <- c(cumsum(rexp(50, 0.2)))
y <- rep(0, length(x))
plot(x,y)
rug(x)
and I have been able to get this:
This result does the work, if it's just about that. However, is there a line of command I can use to edit this result as shown in the second picture, and get a result as shown in the first picture? I would like to just get these tick marks on a horizontal line instead of the whole plot. Or it's not possible?
With ggplot you can use geom_rug() to .add a rug to the plot. First the data need to be made into a data.frame
library("tidyverse")
set.seed(1)
x <- rexp(50, 0.2)
ggplot(data.frame(x), aes(x = x)) + geom_rug()
The rug is rather short (it seems to be a proportion of the graph height and not controllable).
The opposite would be to use geom_vline which will give lines the full length of the y-axis
#ggplot(data.frame(x), aes(xintercept = x)) + geom_vline() #doesn't work
ggplot(data.frame(x)) + geom_vline(aes(xintercept = x))
rug() requires only a vector of values that describes where to draw the tickmarks (rugs). In case of plotting values x on the x-axis, those form the input data for the rug function. Type ?rug to get further help.
# generate y values
y <- rexp(50, 0.2)
# split plotting area into two columns - optional
par(mfrow = c(1, 2))
plot(x, y)
rug(x)
# plot with both axes in log scale to show that rug adjusts to axes scale
plot(x, y, log = "xy")
rug(x)
I've written an R script that loops through a data.frame making multiple of complex plots that includes a histogram. The problem is that the histograms often show a tall, uninformative peak at x=0 or x=1 and it obscures the rest of the data which is more informative. I have figured out that I can hide the tall peak by defining the limits of the x and y axes of each histogram as seen in the code below - but what I really need to figure out is how to define the y-axis limits such that they are optimized for the second-largest peak in my histogram.
Here's some code that simulates my data and plots histograms with different sorts of axis limits imposed:
require(ggplot2)
set.seed(5)
df = data.frame(matrix(sample(c(1:10), 1000, replace = TRUE, prob = c(0.8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)), nrow=100))
cols = names(df)
for (i in c(1:length(cols))) {
my_col = cols[i]
p1 = ggplot(df, aes_string(my_col)) + geom_histogram(bins = 10)
print(p1)
p2 = p1 + ggtitle(paste("Fixed X Limits", my_col)) + scale_x_continuous(limits = c(1,10))
print(p2)
p3 = p1 + ggtitle(paste("Fixed Y Limits", my_col)) + scale_y_continuous(limits = c(0,3))
print(p3)
p4 = p1 + ggtitle(paste("Fixed X & Y Limits", my_col)) + scale_y_continuous(limits = c(0,3)) + scale_x_continuous(limits = c(1,10))
print(p4)
}
The problem is that in this data, I can hard-code y-limits and have a reasonable expectation that they will work well for all the histograms. With my real data the size of the peaks varies wildly between the numerous histograms I am producing. I've tried defining the y-limit with various equations based on descriptive numbers like the mean, median and range but nothing I've come up with works well for all cases.
If I could define the y-limit in relation to the second-tallest peak of the histogram, I would have something that was perfectly suited for each situation.
I am not sure how ggplot builds its histograms, but one method would be to grab the results from hist:
maxDensities <- sapply(df, function(i) max(hist(i)$density))
# take the second highest peak:
myYlim <- rev(sort(maxDensities))[2]
I would process the data to determine the height you need.
Something along the lines of:
sort(table(cut(df$X1,breaks=10)),T)[2]
Working from the inside out
cut will bin the data (not really needed with integer data like you have but probably needed with real data
table then creates a table with the count of each of those bins
sort sorts the table from highest to lowest
[2] takes the 2nd highest value