ggplot2 stat density only for y values - r

I'm wondering whether I can manipulate stat_density2d to show the density for the x values without considering the y values.
To illustrate:
df <- data.frame(x = c(1:40, rep(1:20, 3), 15:40))
ggplot(df, aes(x=x, y = x)) +
stat_density2d(aes(fill='red',alpha=..level..),geom='polygon', show.legend = F) +
geom_point(alpha = 0.3)
Obviously I does't really make sense to plot the sames values against each other, however I'm interested in the density of the plots at a certain value. Therefore I would like to keep y constant (e.g y = 1) but still show the same density like so:
(In my publication I actually have multiple groups, making this a nice way to plot the group separation even though it is 1D)

Related

Why does specifying fill in an Aesthetic mapping change the figure in the plot

When trying highlight a part of a plot, I got an output I didn't expect.
This is the code I'm using to plot the density function of student grades from my dataset.
grades <- student_data$G3
q_aprox = function(x) return (qnorm(x, mean(grades), sd(grades)))
ggplot(student_data, aes(x = G3)) +
# -- IMPORTANT PART BEGIN -- #
geom_density(
color = 'steelblue',
alpha = 0.3,
position = 'stack'
) +
geom_density(
aes(fill = q_aprox(0.025) < G3 & G3 < q_aprox(0.975)),
alpha = 0.3,
position = 'stack'
) + theme_minimal()
# -- IMPORTANT PART END -- #
Unexpectedly, the plot I got from the first geom_density is different than the one I got from the second geom_density. I expected that, since the x and y mappings are left untouched, the plots would be the same.
Why doesn't this happen?
grades, or student_data$G3, is a numeric vector of size 395 with discrete values from 0 up to 20.
Here's the plot that's produced from the previous code
Output Plot - Not enough reputation to post images, sorry
The left tail on the second call is bigger than the one on the first. Also, the output in general seems to be "more spiked".
I recently watched part 1 of ggplot2's workshop on YouTube in preparation for this college assignment. That's more or less my knowledge level regarding ggplot2.
Specifying the fill aesthetic with a discrete variable triggers heuristics in ggplot2 that will automatically group your data. The densities are calculed per group, and densities integrate to 1. Therefore, if you calculate two densities of two groups of unequal sizes, densities still integrate to 1, so the the area of the densities does not reflect the unequal group sizes.
Below is an example of two groups, wherein group A is 10x as large as group B and the groups have different means. You'll notice that if we don't group the data, the resulting density peaks at -1: the center/mean of group A. However, when we auto-group the data with the fill aesthetic, both densities will peak at their own means, but the area of group B is as large as group A (it continuous behind the blue/green density).
library(ggplot2)
library(patchwork)
df <- data.frame(
x = c(rnorm(1000, -1), rnorm(100, 1)),
group = rep(c("A", "B"), c(1000, 100))
)
g1 <- ggplot(df, aes(x)) +
geom_density()
g2 <- ggplot(df, aes(x, fill = group)) +
geom_density()
g1 | g2
If you want to retain proportions to the group sizes, you can use y = after_stat(count) to use the computed variable count, which is the density estimate (which integrates to 1) times the number of observations. You can read about computed variables in the documentation under the header "computed variables" in for example ?geom_density.
ggplot(df, aes(x, fill = group)) +
geom_density(aes(y = after_stat(count)))
Created on 2021-05-12 by the reprex package (v0.3.0)

control x axis of a violin plot in ggplot2

I'm generating violin plots in ggplot2 for a time series, year_1 to year_32. The years in my df are stored as numerical values. From the examples I've seen, it seems that I must convert these numerical year values to factors to plot one violin per year; and in fact, if I run the code without as.factors, I get one big fat violin. I would like to understand why geom_violin can't have numeric values on the x axis; or if I'm wrong about that, how to use them?
So:
my_data$year <- as.factor(my_data$year)
p <- ggplot(data = my_data, aes(x = year, y = continuous_var)+
geom_violin(fill = "#FF0000", color = "#000000")+
ylim(0,500)+
labs(x = "x_label", y = "y_label")
p +my_theme()
works fine, but if I skip
my_data$year <- as.factor(my_data$year)
it doesn't work, I get one big fat violin for all years. Why?
TIA
You miss a ) at the end of this line p <- ggplot(data = my_data, aes(x = year, y = continuous_var)
I have construced a reproducible example with the ToothGrowth dataset:
This should work now:
library(ggplot2)
my_data <- ToothGrowth
my_data$dose <- as.factor(my_data$dose)
p <- ggplot(data = my_data, aes(x = dose, y = len))+
geom_violin(fill = "#FF0000", color = "#000000")+
ylim(0,500)+
labs(x = "x_label", y = "y_label") +
theme_bw()
p
PS: this discussion would better fit Cross Validated, as it's more of an statistics than coding question.
I'm not 100% sure, but here's my explanation: the violin plot shows the density for a set of data, you can divide your data into groups so that you can plot one violin for each part of your data. But if the metric you're using to divide groups (x axis) is a continuous, you're going to have infinite groupings (one group for the values at 0, one for 0.1, one for 0.01, etc.), so in the end you actually can't divide your data, and ggplot probably ignores the x variable and makes one violin for all your data.

move ggplot2 contour from other facets to main

I have x,y,z data with categorical variables that facilitate a facet. I want to include contour lines from all but the first facet and discard the rest of the data. One way to visualize the process is to facet the data and mentally move the contours from the other facets to the first.
MWE:
library(ggplot2)
library(dplyr)
data(volcano)
nx <- 87; ny <- 61
vdat <- data_frame(w=0L, x=rep(seq_len(nx), ny), y=rep(seq_len(ny), each=nx), z=c(volcano))
vdat <- bind_rows(vdat,
mutate(vdat, w=1L, x=x+4, y=y+4, z=z-20),
mutate(vdat, w=2L, x=x+8, y=y+8, z=z-40))
ggplot(vdat, aes(x, y, fill=z)) +
geom_tile() +
facet_wrap(~ w, nrow=1) +
geom_contour(aes(z=z), color='white', breaks=c(-Inf,110,Inf))
In each facet, I have:
facet 0: X,Y,Z for w==0L, contour for w==0L
facet 1: X,Y,Z for w==1L, contour for w==1L
facet 2: X,Y,Z for w==2L, contour for w==2L
What I'd like to have is a single pane, effectively:
X,Y,Z for w==0L, contour for all values of the w categorical
(Forgive my hasty GIMP skills. In the real data, the contours will likely not overlap, but I don't think that that would be a problem.)
The real data has different values (and gradients) of z for the same X,Y system, so the contour is otherwise compatible with the first facet. However, it's still "different", so I cannot mock-up the contours with the single w==0L data.
I imagine there might be a few ways to do this:
form the data "right" the first time, informing ggplot how to pull the contours but lay them on the single plot (e.g., using different data= for certain layers);
form the faceted plot, extract the contours from the other facets, apply them to the first, and discard the other facets (perhaps using grid and/or gtable); or perhaps
(mathematically calculate the contours myself and add them as independent lines; I was hoping to re-use ggplot2's efforts to avoid this ...).
It doesn't fit so neatly with the grammar of graphics, but you can just add a geom_contour call for each subset of data. A quick way is to add a list of such calls to the graph, which you can generate quickly by lapplying across the split data:
ggplot(vdat[vdat$w == 0, ], aes(x, y, z = z, fill = z)) +
geom_tile() +
lapply(split(vdat, vdat$w), function(dat){
geom_contour(data = dat, color = 'white', breaks = c(-Inf, 110, Inf))
})
You can even make a legend, if you need:
ggplot(vdat[vdat$w == 0, ], aes(x, y, z = z, fill = z, color = factor(w))) +
geom_raster() +
lapply(split(vdat, vdat$w), function(dat){
geom_contour(data = dat, breaks = c(-Inf, 110, Inf))
})

Contour plot or heatmap from three continuous variables

I have a model which has told me there is an interaction between two variables: a and b, which is significantly influencing my response variable: c. All three are continuous numeric variables. For detail c is the rate in change my response variable, b is the rate of change in my predictor and a is mean annual rainfall. The unit of analysis is pixels in a raster. So my model is telling me mean annual rainfall modifies how my predictor affects my response.
To visualise this interaction I would like to use a contour plot/heat map/level plot with a and b on the x and y axes and c providing the colour to show me how my response variable changes within the space described by a and b. I can do this with a scatter plot but its not very pretty or easy to interpret:
qplot(b, a, colour = c) +
scale_colour_gradient(low="green", high="red") +
When I try to plot a contour plot/heat map/level plot though all I get is errors, blank plots or ugly plots.
geom_contour gives me an error:
ggplot(data = Mod, aes(x = Rain, y = Bomas, z = Fire)) +
geom_contour()
Warning message:
Not possible to generate contour data
geom_raster initially gives me Error: cannot allocate vector of size 81567.2 Gb but when I round my data it produces:
ggplot(data = df, aes(x = a, y = b, z = c)) +
geom_raster(aes(fill = c))
Adding interpolate = TRUE to the geom_raster code just makes the lines a little blurry.
geom_tile produces a blank graph but with a scale bar for c:
ggplot(data = df, aes(x = a, y = b, z = c)) +
geom_tile(aes(color = c))
I've also tried using stat_density2d and setting the fill and/or the colour to c, but just got an error, and I've tried using levelplot in the lattice package as well but that produces this:
levelplot(c ~ a * b, data = df,
aspect = "asp", contour = TRUE,
xlab = "a",
ylab = "b")
I suspect the problems I'm encountering are because the functions are not set up to deal with continuous x and y variables, all the examples seem to use factors. I would have thought I could compensate for that by changing bin widths but that doesn't seem to work either. Is there a function that allows you to make a heat map with 3 continuous variables? Or do I need to treat my a and b variables as factors and manually make a dataframe with bins appropriate for my data?
If you want to experiment for yourself then you get similar problems to what I'm having with:
df<- as.data.frame(rnorm(1:1068))
df[,2] <- rnorm(1:1068)
df[,3] <- rnorm(1:1068)
names(df) <- c("a", "b", "c")
You can get automatic bins, and for example calculate the means by using stat_summary_2d:
ggplot(df, aes(a, b, z = c)) +
stat_summary_2d() +
geom_point(shape = 1, col = 'white') +
viridis::scale_fill_viridis()
Another good option is to slice your data by the third variable, and plot small multiples. This doesn't really show very well for random data though:
library(ggplot2)
ggplot(df, aes(a, b)) +
geom_point() +
facet_wrap(~cut_number(c, 4))

Interpolating correctly between points in R using ggplot2 and axis scaling

I have some data I want to graph on a semi-log scale, however I get some artifacts when there is a large jump between points. On linear scale, a straight line is drawn between subsequent points, which is a fine approximation for visualization. However, the exact same thing is done when using the log scale (either by using scale_x_log10 or scale_x_continuous with a log transformation). A line between two points on the semi-log scale should show up curved. In other words, this:
df <- data.frame(x = c(0, 1), y = c(0, 1))
ggplot(data = df, aes(x, y)) + geom_line() + scale_x_log10(limits = c(10^-3, 10^0))
produces this:
when I would expect something more like this:
generated by this code:
df <- data.frame(x = seq(0, 1, 0.01), y = seq(0, 1, 0.01))
ggplot(data = df, aes(x, y)) + geom_line() + scale_x_log10(limits = c(10^-3, 10^0))
It's clear what's happening, but I'm not sure what the best way to fix the interpolation is. In the actual data I'm plotting there are a few jumps at various points, which makes the plots very misleading when trying to compare two lines. (They're ROC curves in this instance.)
One thought is I can search the data for jumps and fill in some interpolated points myself, but I'm hoping for a cleaner way that doesn't involve me adding in a bunch of fake data points.
What you describe is a transformation of the coordinate system, not a transformation of the scales. The distinction is that scale transformations take place before any statistical transformations, and coordinate transformations take place afterward. In this case, the "statistical transformation" is "draw a straight line between the points". With a transformed scale, the line is straight in the transformed (log) space; with a transformed coordinate, it is straight in the original (linear) space and therefore curved in log space.
# don't include 0 in the data because log 0 is -Inf
DF <- data.frame(x = c(0.1, 1), y = c(0.1, 1))
ggplot(data = DF, aes(x = x, y = y)) +
geom_line() +
coord_trans(x="log10")

Resources