Changing xlim automatically changes ylim for geom_density - r

I think my problem is best explained by an example:
set.seed(12)
n <- 100
x <- rt(n, 1, 0)
library("ggplot2")
p <- ggplot() + geom_density(aes(x))
p
p + xlim(min(x), 300)
default xlim
new xlim
Why does the y axis automatically change when I change xlim? The density should not change, so it does not make sense to me. When I use base plot this does not happen.
plot(density(x))
plot(density(x), xlim = c(min(x), 300))

Using xlim completely drops observations that are outside of the range. Try using p + coord_cartesian(xlim = c(min(x), 300)).

Use geom_density(..., n=2^16) or similar for a more stable experience.
It would appear that in contrast to density, the function geom_density does take the x range set via xlim into account when deciding at which points to evaluate the density estimation. However, the number of such points remains fixed at 512 (unless using n to set it to a higher value). Hence the larger the x range, the more likely some peaks will be missed. I think this should be documented.

Related

How to get a "perfect " Y-axis for hist(nclass=nclass.scott)?

Occasionally hist(..., nclass=nclass.scott) produces a histogram where the maximum bar extends over the top of the y axis. You may try this example a few times:
x <- sample(1000000, 500, replace=TRUE)
h <- hist(x,nclass=nclass.scott)
text(x=h$mids, y=h$counts, labels=h$counts, pos=3, col="red")
Example:
Occasionally the red number over the highest bar cannot be presented as it seems to be clipped by the plot region. I could add ylim=..., but it's quite tricky to get the maximum height of the bar.
Even when knowing the maximum height, ylim=(0, max) has the problem that max may be ignored: For example, when maximum is 527, then the upper displayed y-axis label is 500, even if ylim=(0, 527) is specified. When using 600 instead, it works, but then the y-axis is a bit too long...
If that is not a bug of R (3.3.3), what is an elegant (minimalistic) solution?
I think you need to set par(xpd= T) in your graph to avoid the trimming.
?par
xpd
A logical value or NA. If FALSE, all plotting is clipped to the
plot region, if TRUE, all plotting is clipped to the figure region,
and if NA, all plotting is clipped to the device region. See also
clip.
You can do it better by collaborating with usr option and xpd.Upon observation the bars seems going out of chart but it is not the bars that are going outside the chart but the axis being restricted to the labels. Hence to fix the labels we can choose to use usr. In case someone wants to play with the margin, one can also use mar.
library(RColorBrewer)
par(mfrow=c(1,1),xpd=T,yaxs="i")
x <- sample(1000000, 500, replace=TRUE)
h <- hist(x,nclass=nclass.scott,axes=FALSE,col=brewer.pal(10,"Set3"))
# usr <- par("usr")
at <- c(0, 10,30, par("usr")[4])
axis(2,at=at,labels = round(at))
text(x=h$mids, y=h$counts, labels=h$counts, pos=3, col="red")
usr
A vector of the form c(x1, x2, y1, y2) giving the extremes of the
user coordinates of the plotting region. When a logarithmic scale is
in use (i.e., par("xlog") is true, see below), then the x-limits will
be 10 ^ par("usr")[1:2]. Similarly for the y-axis.
You may want to run it several times, I have run it for many times, the bar won't seems to go outside the chart now.
Output:
What you describe is not a bug. You are using functionality to draw a histogram and then you want to add text to it. The function has not been designed for that, hence you need to reserve some additional white space for the text.
I suggest you run the function once, to get the "base values" of the graph. Then run the function again with adjusted scale (extra space for the text). In order to achieve this, you could use the following code
set.seed(9876) ### for reproducibility
x <- sample(1000000, 500, replace = TRUE)
h <- hist(x, nclass = nclass.scott, plot = FALSE)
### use the info from the previous call to adjust the y-scale with a constant
hist(x, nclass = nclass.scott, ylim = c(0, max(h$counts) + 10))
text(x = h$mids, y = h$counts, labels = h$counts, pos = 3, col = "red")
### ... or add a proportion (a little bit more robust)
hist(x, nclass = nclass.scott, ylim = c(0, max(h$counts) * 1.075))
text(x = h$mids, y = h$counts, labels = h$counts, pos = 3, col = "red")
Please let me know whether this is what you want.

How to fit R histogram within axes limits [0,1]

Suppose I generate data using x <- rnorm(10000) and then plot a simple histogram using hist(x).
This obviously shows that the data is normal, but the x and y axes are determined by the values generated. How could I adjust x so that the histogram will still appear as a normal curve, but on a plot whose bounds are x=[0,1] and y=[0,1]. I tried using this normalization method from another answer, https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range, and setting xlim and ylim to c(0,1), but the result was not what I wanted, as it basically just fills up the entire plot.
I'm not sure what you mean by 'fills up the whole plot'. This code seems to work fine:
x <- rnorm(1000)
z <- (x - min(x))/(max(x) - min(x))
hist(z)
Then if you want the y-axis on a scale of 0-1:
hist1 <- hist(z)
hist1$counts <- hist1$counts/sum(hist1$counts)
plot(hist1, ylim = c(0,1)) ## Looks squished to me if you include the ylim argument

log-transformed density function not plotting correctly

I'm trying to log-transform the x axis of a density plot and get unexpected results. The code without the transformation works fine:
library(ggplot2)
data = data.frame(x=c(1,2,10,11,1000))
dens = density(data$x)
densy = sapply(data$x, function(x) { dens$y[findInterval(x, dens$x)] })
ggplot(data, aes(x = x)) +
geom_density() +
geom_point(y = densy)
If I add scale_x_log10(), I get the following result:
Apart from the y values having been rescaled, something seems to have happened to the x values as well -- the peaks of the density function are not quite where the points are.
Am I using the log transformation incorrectly here?
The shape of the density curve changes after the transformation because the distribution of the data has changed and the bandwidths are different. If you set a bandwidth of (bw=1000) prior to the transformation and 10 afterward, you will get two normal looking densities (with different y-axis values because the support will be much larger in the first case). Here is an example showing how varying bandwidths change the shape of the density.
data = data.frame(x=c(1,2,10,11,1000), y=0)
## Examine how changing bandwidth changes the shape of the curve
par(mfrow=c(2,1))
greys <- colorRampPalette(c("black", "red"))(10)
plot(density(data$x), main="No Transform")
points(data, pch=19)
plot(density(log10(data$x)), ylim=c(0,2), main="Log-transform w/ varying bw")
points(log10(data$x), data$y, pch=19)
for (i in 1:10)
points(density(log10(data$x), bw=0.02*i), col=greys[i], type="l")
legend("topright", paste(0.02*1:10), col=greys, lty=2, cex=0.8)

How are trellis axis limits calculated?

Say I want to create an ordinary xyplot without explicitly specifying axis limits, then how are axis limits calculated?
The following line of code produces a simple scatter plot. However, axis limits do not exactly range from 1 to 10, but are slightly expanded to the left and right and top and bottom sides (roughly by 0.5).
library(lattice)
xyplot(1:10 ~ 1:10, cex = 1.5, pch = 20, col = "black",
xlab = "x", ylab = "y")
Is there any way to determine the factor by which the axes were expanded on each site, e.g. using trellis.par.get? I already tried the following after executing the above-mentioned xyplot command:
library(grid)
downViewport(trellis.vpname(name = "figure"))
current.panel.limits()
$xlim
[1] 0 1
$ylim
[1] 0 1
Unfortunately, the panel limits are returned as normalized parent coordinates, which makes it impossible to obtain the "real" limits. Any suggestions would be highly appreciated!
Update:
Using base-R plot, the data range (and consequently the axis limits) is by default extended by 4% on each side, see ?par. But this factor doesn't seem to apply to 'trellis' objects. So what I am looking for is an analogue to the 'xaxs' (and 'yaxs') argument implemented in par.
Axis limits for xyplot are calculated in the extend.limits function. This function isn't exported from the lattice package, so to see it, type lattice:::extend.limits. Concerning a numeric vector, this function is passed the range of values from the corresponding data (c(1, 10) in this example). The final limits are calculated according to the following equation:
lim + prop * d * c(-1, 1)
lim are the limits of the data, in this case c(1, 10)
prop is lattice.getOption("axis.padding")$numeric, which by default is 0.07
d is diff(as.numeric(lim)), in this case 9
The result in this case is c(0.37, 10.63)
In case you're interested, the call stack from xyplot to extend.limits is
xyplot
xyplot.formula
limits.and.aspect
limitsFromLimitList
extend.limits

Get xlim from a plot in R

I want an hist and a density on the same plot, I'm trying this:
myPlot <- plot(density(m[,1])), main="", xlab="", ylab="")
par(new=TRUE)
Oldxlim <- myPlot$xlim
Oldylim <- myPlot$ylim
hist(m[,3],xlim=Oldxlim,ylim=Oldylim,prob=TRUE)
but I can't access myPlot's xlim and ylim.
Is there a way to get them from myPlot? What else should I do instead?
Using par(new=TRUE) is rarely, if ever, the best solution. Many plotting functions have an option like add=TRUE that will add to the existing plot (including the plotting function for histograms as mentioned in the comments).
If you really need to do it this way then look at the usr argument to the par function, doing mylims <- par("usr") will give the x and y limits of the existing plot in user coordinates. However when you use that information on a new plot make sure to set xaxs='i' or the actual coordinates used in the new plot will be extended by 4% beyond what you specify.
The functions grconvertX and grconvertY are also useful to know. They could be used or this purpose, but are probably overkill compared to par("usr"), but they can be useful for finding the limits in other coordinate systems, or finding values like the middle of the plotting region in user coordinates.
Have you considered specifying your own xlim and ylim in the first plot (setting them to appropriate values) then just using those values again to set the limits on the histogram in the second plot?
Just by plotting density on its own you should be able to work out sensible values for the minimum and maximum values for both axes then replace xmin, xmax, ymin and ymax for those values in the code below.
something like;
myPlot <- plot(density(m[,1])), main="", xlab="", ylab="", xlim =c(xmin, xmax), ylim = c(ymin, ymax)
par(new=TRUE)
hist(m[,3],xlim=c(min, max),ylim=c(min, max),prob=TRUE)
If for any reason you are not able to use range() to get the limits, I'd follow #Greg's suggestion. This would only work if the par parameters "xaxs" and "yaxs" are set to "s" (which is the default) and the coordinate range is extended by 4%:
plot(seq(0.8,9.8,1), 10:19)
usr <- par('usr')
xr <- (usr[2] - usr[1]) / 27 # 27 = (100 + 2*4) / 4
yr <- (usr[4] - usr[3]) / 27
xlim <- c(usr[1] + xr, usr[2] - xr)
ylim <- c(usr[3] + yr, usr[4] - yr)
I think the best solution is to fix them when you plot your density.
Otherwise hacing in the code of plot.default (plot.R)
xlab=""
ylab=""
log =""
xy <- xy.coords(x, y, xlab, ylab, log)
xlim1 <- range(xy$x[is.finite(xy$x)])
ylim1 <- range(xy$y[is.finite(xy$y)])
or to use the code above to generate xlim and ylim then call your plot for density
dd <- density(c(-20,rep(0,98),20))
plot(dd,xlim=xlim1,ylim=ylim1)
x <- rchisq(100, df = 4)
hist(x,xlim=xlim1,ylim=xlim1,prob=TRUE,add=TRUE)
Why not use ggplot2?
library(ggplot2)
set.seed(42)
df <- data.frame(x = rnorm(500,mean=10,sd=5),y = rlnorm(500,sdlog=1.1))
p1 <- ggplot(df) +
geom_histogram(aes(x=y,y = ..density..),binwidth=2) +
geom_density(aes(x=x),fill="green",alpha=0.3)
print(p1)

Resources