R plot, how to start y at non zero? - r

I have a set of data which goes from: 8e41 to ~ 1e44. I want to plot starting at 1e41 on the y-axis, however using the argument: ylim=c(1e41,1e45) does not work for me.
Here is the minimally reproducible code:
x = c(1,2)
y = c(1e41,1e44)
plot(x,y,ylim=c(1e41,1e45))

The problem is that 1e41 is so much closer to 0 than 1e45 that it's virtually the same. Have you considered working on the log scale?
plot(x,y,ylim=c(1e41,1e45), log = 'y')
or even
plot(x, y, log = 'y')
Think of this another way - rescale your data by dividing your range by 1e41: c(8e41, 1e44)/ 1e41 - you get 8 and 1000. Is there any significant difference between starting the scale at 0 (or 1) versus 8? If you chose to divide by 1e40 instead, you're looking at 80 and 10,000. Try the following code to see this:
m <- 1e41 # change this as desired
plot(x, y / m)
abline(h = c(0, 1e41 / m))
By changing m, the only thing that changes is the numbers on the y-axis, the relative positions do not change. Look at how close 0 and 8e41 are, and you'll see why it really doesn't matter whether or not the plot starts at 0 versus 1e41. As a fraction of the total height of the plot, the difference is 1/1000.
Changing the values at which the axis is labeled
Here's one more option for you - changing the values at which the plot is marked. That requires two steps - first, removing the axis labels when the plot is originally created, then adding in the ones you actually want:
plot(x, y, yaxt = 'none')
axis(2, c(1e41, seq(1e43, 1e44, 1e43)))

library(ggplot2)
x = c(1,2)
y = c(1e41,1e44)
data = data.frame(x,y)
ggplot(data, aes(x=x, y=log(y))) + geom_point() + ylim(90,150)
I think you should use the log of y, as it shows the same data.

Related

How to fix unstable y-positions for geom_jitter() for ggplot2 in R?

I'm doing a common R ggplot2 graph with boxplot: boxplots supplemented individual samples as points shown by geom_jitter(), to show the individual sample positions and numbers in each group. Normally I have not noticed a problem, but with some recent data, I've noticed substantial inaccuracy and variation in the y position of the jitter. However, the boxplot stays stable with respect to the Y, and so does geom_point() when used to show the same points as jitter is plotting. Error is likely not noticeable when you have many data points, but if required to do something with 5-10 samples in a group, it can produce an obvious error that makes a plot that may mislead you, if you were not aware of the issue.
I first thought this may have always happened and I didn't notice, so I made some random numbers and made a ggplot with geom_jitter(), but at first the problem disappeared. Some example data and plots are given to show the normal and problematic cases.
Data generation and plotting that worked as expected:
df <- data.frame("X" = rep("X", 5), "Y" = rnorm(5, 100, 30))
check the plot:
library(ggplot2)
ggplot(df, aes(X, Y)) + geom_boxplot() + geom_jitter(col = "red") + geom_point(col = "blue")
The red and blue dots are almost exactly aligned, and you can just watch the plot come in RStudio preview if you repeat the code 5 times and not notice variation in the jitter point y-position (only horizontally along the X-axis, as expected). In a problematic case like below, you quickly see the y-axis point variation, especially because it sometimes shifts the range of the y-axis.
With more variation in random numbers, I found a difference visible between the red and blue points, which varied each time of plotting the same data:
df <- data.frame("X" = rep("X", 5), "Y" = rnorm(5, 100, 400))
The actual numbers to get this problem were:
X Y
1 X 610.78026
2 X -38.58905
3 X -196.00943
4 X 94.37797
5 X 415.58417
In my result, the lowest point, -196, sometimes was about -170, sometimes about -250. The range of the y-axis shifts each time. It's similar to the problem I had happen with real data.
I found with other testing of data having more variance, or a larger range between points, did not explain occurrence variability of the jitter y-position. In some cases with more variance, geom_jitter() again produced near perfect y-positions. So I wondered if it may have something to do with trouble mapping with certain plot areas used by ggplot2. I thought to test that by forcing ggplot to keep the same ylimit using ylim(-206, 621) but it failed to stop the area with the above problematic case. It gives a mysterious, yet consistent error of: "Warning message: Removed 1 rows containing missing values (geom_point)."
(In the corresponding plot, it lost the red jitter point for the 610.7 value, despite enough pixel space in the plot preview window for about 10 more points between the blue point and the top of the graph. In another attempt, 2 jitter points get lost, because the bottom sometimes goes past the lower limit).
A roundabout solution would be to make random points for the X group, all keeping the same Y and group identity, but it's not efficient. When non-numerical groups are used on X, I found it will have a numerical position of 1 for any labels being added. Adding the following to the last dataframe gives the proper appearance + geom_point(aes(x= rnorm(5, 1, .2), y = Y), col = "yellow") - but that would become quite cumbersome if there are many groups if there is not some way to automatically get the correct X position for groups of boxplots.
To solve the problem, any input on what the cause of it is would be a great help.
It sounds like you do not want the default geom_jitter behavior, which adds a uniformly distributed amount of noise separately to the x and y value before plotting, by default "40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins."
For a continuous variable like yours "resolution" is "the smallest non-zero distance between adjacent values."
Try this:
geom_jitter(col = "red", height = 0) +
That will tell ggplot you want no noise applied to the y values before plotting.
Another approach would be to add noise yourself before the plotting step, giving you the ability to control its distribution and range specifically.
e.g. instead of having the jittering fill a uniform rectangle: ...
library(dplyr)
tibble(x = rep(1:2, each = 1000),
y = rep(3:4, each = 1000)) -> point_data
ggplot(point_data, aes(x,y)) + geom_jitter()
We could add whatever noise function we want. Here, for no particular reason, I make donuts around the real data, and compare that to the default jitter:
point_data %>%
mutate(angle = runif(2000, 0, 2*pi),
dist = rnorm(2000, 0.3, 0.05),
x2 = x + dist*cos(angle),
y2 = y + dist*sin(angle)) %>%
ggplot() +
geom_jitter(aes(x,y), color = "red", alpha = 0.2) +
geom_point(aes(x2,y2))

How do I use the rug function for an exponential distribution in R?

I am just starting out in R. I want to plot interval of times (the distribution is exponential) on x axis, with a tick mark in place every time the interval ends. If I have a string of times say (0.2, 0.8, 0.9 , 1.0) then the tick marks on the x axis would be on 0.2, 0.8, 0.9 and 1 respectively. With big data samples, I want my graph to look something like:
So after using,
set.seed(1)
x <- rexp(50, 0.2)
How can I go about it further, might I have to use rug function (which I am trying to learn how to use)? Can I also put time stamps on this graph?
Edit
So I have modified my command and used:
x <- c(cumsum(rexp(50, 0.2)))
y <- rep(0, length(x))
plot(x,y)
rug(x)
and I have been able to get this:
This result does the work, if it's just about that. However, is there a line of command I can use to edit this result as shown in the second picture, and get a result as shown in the first picture? I would like to just get these tick marks on a horizontal line instead of the whole plot. Or it's not possible?
With ggplot you can use geom_rug() to .add a rug to the plot. First the data need to be made into a data.frame
library("tidyverse")
set.seed(1)
x <- rexp(50, 0.2)
ggplot(data.frame(x), aes(x = x)) + geom_rug()
The rug is rather short (it seems to be a proportion of the graph height and not controllable).
The opposite would be to use geom_vline which will give lines the full length of the y-axis
#ggplot(data.frame(x), aes(xintercept = x)) + geom_vline() #doesn't work
ggplot(data.frame(x)) + geom_vline(aes(xintercept = x))
rug() requires only a vector of values that describes where to draw the tickmarks (rugs). In case of plotting values x on the x-axis, those form the input data for the rug function. Type ?rug to get further help.
# generate y values
y <- rexp(50, 0.2)
# split plotting area into two columns - optional
par(mfrow = c(1, 2))
plot(x, y)
rug(x)
# plot with both axes in log scale to show that rug adjusts to axes scale
plot(x, y, log = "xy")
rug(x)

Scale huge axis in R for plotting

I have a huge file I load into one vector
y = scan("my_file)
My x axis is also really huge, lets say it is in range of x=1:5000000
My question is now how can I scale my plot so that I actually can see something?So far I am doing the following
UPDATE:
plot(x, y, log="x", pch=".")
However only the logarithm is not enough. Can i somehow scale the x more, like taking a sqrt or something, and if yes how? Sorry this may be a simple question but I am really new to R..
I am not sure how to add a file, but the file I a using to load into vector y is as I said, only 5 million values of entry 1,2 or 0: so
y=c(1,0,1,....................)
the x axis as I mentioned above.
The second thing I tried was:
zerotwo <- data.frame(x, y)
ggplot(aes(x, y, fill=as.factor(y)), data=zerotwo) + geom_tile() + scale_x_continuous(trans='log2') + geom_tile()
But here also the fill=as.factor doesnt do its job
Another possibility is to rely on color-coding to encode the value, and use the y-axis to put the data into different rows. 5m tiles on the x-axis is a bit much for ggplot, but 50*100k works ok if the plot size is large enough.read from left to right, then top to bottom.
# create test data
zerototwo <- data.frame(position=1:5000000, value=sample(0:2, 5000000, replace=TRUE))
# for your data: zerototwo <- data.frame(position=1:length(y), value=y)
zerototwo$row <- floor((zerototwo$position -1)/100000)
zerototwo$rowpos <- (zerototwo$position - 1) %% 100000
ggplot(aes(x=rowpos, y=row, fill=as.factor(value)), data=zerototwo) +
geom_tile(height=0.9) + scale_y_reverse()

How to deal with zero in log plot

The Problem
I have data that I would like to plot in a line-graph with a log-scale on the y-axis using ggplot2. Unfortunately, some of my values go all the way down to zero. The data represents relative occurences of a feature in dependence of some parameters. The value zero occurs when that feature is not observed in a sample, which means that it occurs very seldomly, or indeed never. These zero values cause a problem in the log plot.
The following code illustrates the problem on a simplified data set. In reality the data set consists of more points, so the curve looks smoother, and also more values for the parameter p.
library(ggplot2)
dat <- data.frame(x=rep(c(0, 1, 2, 3), 2),
y=c(1e0, 1e-1, 1e-4, 0,
1e-1, 1e-3, 0, 0),
p=c(rep('a', 4), rep('b', 4)))
qplot(data=dat, x=x, y=y, colour=p, log="y", geom=c("line", "point"))
Given the data above, we would expect two lines, the first one should have three finite points on a log plot, the second one should have only two finite points on a log plot.
However, as you can see this produces a very misleading plot. It looks like the blue and red line are both converging to a value between 1e-4 and 1e-3. The reason is that log(0) gives -Inf, which ggplot just puts on the lower axis.
My Question
What's the best way to deal with this in R with ggplot2? By best I mean in terms of efficiency, and being ideomatic R (I'm fairly new to R).
The plot should indicate that these curves go down to "very small" after x=2 (red), or x=1 (blue), respectively. Ideally, with a vertical line downwards from the last finite point. What I mean by that is demonstrated in the following.
My Attempt
Here I'll describe what I've come up with. However, given that I'm fairly new to R, I suspect that there might a much better way.
library(ggplot2)
library(scales)
dat <- data.frame(x=rep(c(0, 1, 2, 3), 2),
y=c(1e0, 1e-1, 1e-4, 0,
1e-1, 1e-3, 0, 0),
p=c(rep('a', 4), rep('b', 4)))
Same data as above.
Now, I'm going through each unique parameter p, find the x coordinate of the last finite point, and assign it to the x coordinates of all points where y is zero. That is to achieve a vertical line.
for (p in unique(dat$p)) {
dat$x[dat$p == p & dat$y == 0] <- dat$x[head(which(dat$p == p & dat$y == 0), 1) - 1]
}
At this point the plot looks as follows.
The vertical lines are there. However, there are also points. These are misleading as they indicate that there was an actual data point there, which is not true.
To remove the points I duplicate the y data (seems wasteful), let's call it yp, and replace zero by NA. Then I use that new yp as the y aesthetics for geom_point.
dat$yp <- dat$y
dat$yp[dat$y == 0] <- NA
ggplot(dat, aes(x=x, y=y, colour=p)) +
geom_line() +
geom_point(aes(y=dat$yp)) +
scale_y_continuous(trans=log10_trans(),
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
Where I've used ggplot instead of qplot so that I can give different aesthetics to geom_line and geom_point.
Finally, the plot looks like this.
What is the right way to do this?
For me, I use
+ scale_y_continuous(trans=scales::pseudo_log_trans(base = 10))
If you're using ggplot, you can use scales::pseudo_log_trans() as your transformation object. This will replace your -inf with 0.
From the docs (https://scales.r-lib.org/reference/pseudo_log_trans.html),
A transformation mapping numbers to a signed logarithmic scale with a smooth transition to linear scale around 0.
pseudo_log_trans(sigma = 1, base = exp(1))
For example, my scale expression looks like this:
+ scale_fill_gradient(name = "n occurrences", trans="pseudo_log")
Unconfirmed, but you probably need to include the scales library:
require("scales")
The simplest way would be to add a small value to each of the numbers. Example,
df <- mutate(df, log_var = log(var + 0.01))
ggplot(df, aes(x = log_var)) + geom_histogram()

How can I set axis ranges in ggplot2 when using a log scale?

I have a time series of data where the measurements are all integers between 1e6 and 1e8: website hits per month. I want to use ggplot2 to chart these with points and lines, but mapping the measurements to a log scale. Something like this:
qplot(month, hits, data=hits.per.month, log="y")
When I do that, ggplot seems to set the scale from 1e6 to 1e8. I want it to scale from 0 to 1e8. The natural way of doing this seems to have no affect on the output:
qplot(month, hits, data=hits.per.month, log="y", ylim=c(0, 100000000))
I can get the picture I want by transforming hits before it reaches qplot, but that changes the labels on the axis:
qplot(month, log10(hits), data=hits.per.month, log="y", ylim=c(0, 8))
I also tried various combinations with scale_y_log10, but had no luck.
So, how do I set the Y axis range when using a log scale in ggplot2?
Much of ggplot2 is simply clearer to me if one doesn't use qplot. That way you aren't cramming everything into a single function call:
df <- data.frame(x = 1:10,
y = seq(1e6,1e8,length.out = 10))
ggplot(data = df,aes(x = x, y =y)) +
geom_point() +
scale_y_log10(limits = c(1,1e8))
I'm going to assume you didn't really mean a y axis minimum of 0, since on a log scale that, um, is problematic.

Resources