R: Weighted Joyplot/Ridgeplot/Density Plot? - r

I am trying to create a joyplot using the ggridges package (based on ggplot2). The general idea is that a joyplot creates nicely scaled stacked density plots. However, I cannot seem to produce one of these using weighted density. Is there some way of incorporating sampling weights (for weighted density) in the calculation of the densities in the creation of a joyplot?
Here's a link to the documentation for the ggridges package: https://cran.r-project.org/web/packages/ggridges/ggridges.pdf I know a lot of packages based on ggplot can accept additional aesthetics, but I don't know how to add weights to this type of geom object.
Additionally, here is an example of an unweighted joyplot in ggplot. I am trying to convert this to a weighted plot with the density weighted according to pweight.
# Load package, set seed
library(ggplot)
set.seed(1)
# Create an example dataset
dat <- data.frame(group = c(rep("A",100), rep("B",100)),
pweight = runif(200),
val = runif(200))
# Create an example of an unweighted joyplot
ggplot(dat, aes(x = val, y = group)) + geom_density_ridges(scale= 0.95)

It looks like the way to do this is to use stat_density rather than the default stat_density_ridges. Per the docs you linked to:
Note that the default stat_density_ridges makes joint density
estimation across all datasets. This may not generate the desired
result when using faceted plots. As an alternative, you can set
stat = "density" to use stat_density. In this case, it is required
to add the aesthetic mapping height = ..density.. (see examples).
Fortunately, stat_density (unlike stat_density_ridges) understands the aesthetic weight and will pass it to the underlying density call. You end up with something like:
ggplot(dat, aes(x = val, y = group)) +
geom_density_ridges(aes(height=..density.., # Notice the additional
weight=pweight), # aes mappings
scale= 0.95,
stat="density") # and use of stat_density
The ..density.. variable is automatically generated by stat_density.
Note: It appears that when you use stat_density the x-axis range behaves a little differently: it will trim the density plot to the data range and drop the nice-looking tails. You can easily correct this by manually expanding your x-axis, but I thought it was worth mentioning.

Related

Plotting Chi-square Distribution with ggplot2 in R

I would like to use R to randomly construct chi-square distribution with the degree of freedom of 5 with 100 observations. After doing so, I want to calculate the mean of those observations and use ggplot2 to plot the chi-square distribution with a bar chart. The following is my code:
rm(list = ls())
library(ggplot2)
set.seed(9487)
###Step_1###
x_100 <-data.frame(rchisq(100, 5, ncp = FALSE))
###Step_2###
mean_x <- mean(x_100[,1])
class(x_100)
###Step_3###
plot_x_100 <- ggplot(data = x_100, aes(x = x_100)) +
geom_bar()
plot_x_100
Firstly, I construct a data frame of a random chi-square distribution with df = 5, obs = 100.
Secondly, I calculate the mean value of this chi-square distribution.
At last, I plot the graph with the ggplot2 package.
However, I get the result like the follows:
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
Error in is.finite(x) : default method not implemented for type 'list'
I got stuck in this problem for several hours and cannot find any list in my global environment. It would be appreciated if anyone can help me and give me some suggestions.
The problem is that inside the ggplot function you are calling the same dataframe (x_100) as both the data and the x variable inside aes. Remember that in ggplot, inside aes you should indicate the name of the column you wish to map. Additionally, if you want to plot the chi-square distribution I think it might be a better idea to use the geom_histogram instead of geom_bar, as the first one groups the observations into bins.
library(ggplot2)
# Rename the only column of your data frame as "value"
colnames(x_100) <- "value"
plot_x_100 <- ggplot(data = x_100, aes(x = value)) +
geom_histogram(bins = 20)

Why scatter plots in ggpairs function don't have the loess layer on them?

I have a quick question, and can't figure out what the problem is. I wanted to plot a dataset I have, and found one solution here:
How to use loess method in GGally::ggpairs using wrap function
However, I can't seem to figure out what was wrong with my approach. Here is the code chunk below with simple mtcars dataset:
library(ggplot2)
library(GGally)
View(mtcars)
GGally::ggpairs(mtcars,
lower= list(
ggplot(mapping = aes(rownames(mtcars))) +
geom_point()+
geom_smooth(method = "loess"))
)
Here, as you can see, is my output that doesn't put the smooth layer on the scatter plot. I wanted to have it for the regression analysis for my actual dataset. Any direction or explanation would be good. Thank you!
The solution in the post from #Edward's comment works here with mtcars. The snippet below replicates your plot above, with a loess line added:
library(ggplot2)
library(GGally)
View(mtcars)
# make a function to plot generic data with points and a loess line
my_fn <- function(data, mapping, method="loess", ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=method, ...)
p
}
# call ggpairs, using mtcars as data, and plotting continuous variables using my_fn
ggpairs(mtcars, lower = list(continuous = my_fn))
In your snippet, the second argument lower has a ggplot object passed to it, but what it requires is a list with specifically named elements, that specify what to do with specific variable types. The elements in the list can be functions or character vectors (but not ggplot objects). From the ggpairs documentation:
upper and lower are lists that may contain the variables 'continuous',
'combo', 'discrete', and 'na'. Each element of the list may be a
function or a string. If a string is supplied, it must implement one
of the following options:
continuous exactly one of ('points', 'smooth', 'smooth_loess',
'density', 'cor', 'blank'). This option is used for continuous X and Y
data.
combo exactly one of ('box', 'box_no_facet', 'dot', 'dot_no_facet',
'facethist', 'facetdensity', 'denstrip', 'blank'). This option is used
for either continuous X and categorical Y data or categorical X and
continuous Y data.
discrete exactly one of ('facetbar', 'ratio', 'blank'). This option is
used for categorical X and Y data.
na exactly one of ('na', 'blank'). This option is used when all X data
is NA, all Y data is NA, or either all X or Y data is NA.
The reason my snippet works is because I've passed a list to lower, with an element named 'continuous' that is my_fn (which generates a ggplot).

ggplot2 vs sm package density plot output (and statistical analysis)

Consider the following data frame example
library('ggplot2')
library('sm')
original<-c(1:100,1)
a<-sample(original,100)
b<-rep(1:4,25)
lala<-data.frame(a,b)
My aim is to produce density plots for values in lala$a, according to each group (1,2,3,4) defined in lala$b.
For doing so in ggplot2 I could do the following
plotDensityggplot<-ggplot()+
geom_density(data = lala, aes(a, colour=factor(b)))+
theme_classic()
print(plotDensityggplot)
producing this:
However, when I plot the same data using the 'sm' package to make a formal comparison of the densities using the following code:
sm.density.compare(lala$a,as.numeric(lala$b),model = "equal")
The density curves extend beyond zero in the X-axis, despite there is no value below zero in lala$a
What's going on? - note that this affect the densities reported in the y-axis.
Is the p-value from the permutation test of equality obtained from sm.density.compare a reliable estimate? - thank you!
For what it's worth, you can (more or less) reproduce the sm output in ggplot by pre-computing densities with base R's density (I'm not familiar with sm but I imagine that sm.density calls base R's density at some point as well).
library(tidyverse)
lala %>%
group_by(b) %>%
summarise(tmp = list(map_dfc(c("x", "y"), ~density(a)[.x]))) %>%
unnest() %>%
ggplot(aes(x, y, colour = as.factor(b))) +
geom_line()
I'm not sure how geom_density (or stat_density) tune kernel density estimation parameters, but you seem to have less control over them than in base R's density.

How to interpret the different ggplot2 densities?

I am confused about the meaning of the following variants of geom_density in ggplot:
Can someone please explain the difference between these four calls:
geom_density(aes_string(x=myvar))
geom_density(aes_string(x=myvar, y=..density..))
geom_density(aes_string(x=myvar, y=..scaled..))
geom_density(aes_string(x=myvar, y=..count../sum(..count..)))
My understanding is that:
geom_density alone will produce a density whose area under the curve sums to 1
geom_density with ..density.. basically does the same... ?
the ..count../sum(..count..) will normalize the peak heights to be more like a normalized histogram, ensuring that all the heights sum to 1
the ..count.. by itself without the denominator will just multiply each bin by # of items in it
the ..scaled.. parameter will make it so the maximum value of the density is 1.
I find ..scaled.. very counterintuitive and have never seen it used if my interpretation of it is correct so I'd like to ignore that. I am mainly looking for a clarification of the differences between geom_density and a kind of normalized density plot, which I am assuming requires the ...count../... argument. thanks.
(Related: Error with ggplot2 mapping variable to y and using stat="bin")
The default aesthetic for stat_density is ..density.., so a call to geom_density which uses stat_density by default, will plot y = ..density.. by default.
You can see how the various columns are caculated by looking at the source code
..scaled.. is defined as
densdf$scaled <- densdf$y / max(densdf$y, na.rm = TRUE)
Feel free to ignore it if you wish.
Looking at the source code for stat_bin
The results are computed as such
res <- within(results, {
count[is.na(count)] <- 0
density <- count / width / sum(abs(count), na.rm=TRUE)
ncount <- count / max(abs(count), na.rm=TRUE)
ndensity <- density / max(abs(density), na.rm=TRUE)
})
So if you want to compare the results of geom_histogram (using the default stat = 'bin'), then you can set y = ..density.. and it will calculate count / sum(count) for you (accounting for the width of the bins)
If you wanted to compare geom_density(aes(y=..scaled..)) with stat_bin, then you would use geom_histogram(aes(y = ..ndensity..))
You could get them on the same scale by using ..count.. in both as well, however you would need to adjust the adjust parameter in stat_density to get the appropriately detailed approximation of the curve.

Error with ggplot2 mapping variable to y and using stat="bin"

I am using ggplot2 to make a histogram:
geom_histogram(aes(x=...), y="..ncount../sum(..ncount..)")
and I get the error:
Mapping a variable to y and also using stat="bin".
With stat="bin", it will attempt to set the y value to the count of cases in each group.
This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
If you want y to represent values in the data, use stat="identity".
See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
What causes this in general? I am confused about the error because I'm not mapping a variable to y, just histogram-ing x and would like the height of the histogram bar to represent a normalized fraction of the data (such that all the bar heights together sum to 100% of the data.)
edit: if I want to make a density plot geom_density instead of geom_histogram, do I use ..ncount../sum(..ncount..) or ..scaled..? I'm unclear about what ..scaled.. does.
The confusion here is a long standing one (as evidenced by the verbose warning message) that all starts with stat_bin.
But users don't typically realize that their confusion revolves around stat_bin, since they typically encounter problems while using either geom_bar or geom_histogram. Note the documentation for each: they both use stat = "bin" (in current ggplot2 versions this stat has been split into stat_bin for continuous data and stat_count for discrete data) by default.
But let's back up. geom_*'s control the actual rendering of data into some sort of geometric form. stat_*'s simply transform your data. The distinction is a bit confusing in practice, because adding a layer of stat_bin will, by default, invoke geom_bar and so it can seem indistinguishable from geom_bar when you're learning.
In any case, consider the "bar"-like geom's: histograms and bar charts. Both are clearly going to involve some binning of data somewhere along the line. But our data could either be pre-summarised or not. For instance, we might want a bar plot from:
x
a
a
a
b
b
b
or equivalently from
x y
a 3
b 3
The first hasn't been binned yet. The second is pre-binned. The default behavior for both geom_bar and geom_histogram is to assume that you have not pre-binned your data. So they will attempt to call stat_bin (for histograms, now stat_count for bar charts) on your x values.
As the warning says, it will then try to map y for you to the resulting counts. If you also attempt to map y yourself to some other variable you end up in Here There Be Dragons territory. Mapping y to functions of the variables returned by stat_bin (..count.., etc.) should be ok and should not throw that warning (it doesn't for me using #mnel's example above).
The take-away here is that for geom_bar if you've pre-computed the heights of the bars, always remember to use stat = "identity", or better yet use the newer geom_col which uses stat = "identity" by default. For geom_histogram it's very unlikely that you will have pre-computed the bins, so in most cases you just need to remember not to map y to anything beyond what's returned from stat_bin.
geom_dotplot uses it's own binning stat, stat_bindot, and this discussion applies here as well, I believe. This sort of thing generally hasn't been an issue with the 2d binning cases (geom_bin2d and geom_hex) since there hasn't been as much flexibility available in the analogous z variable to the binned y variable in the 1d case. If future updates start allowing more fancy manipulations of the 2d binning cases this could I suppose become something you have to watch out for there.
The documentation for geom_histogram states that it is an alias for stat_bin and geom_bar
The documentation for geom_density states that uses a smooth density estimate produced using stat_density
Following the links (or finding the help pages directly)
stat_bin
The documentation for stat_bin describes how stat_bin returns a data.frame with the following (additional) columns
count number of points in bin
density density of points in bin, scaled to integrate to 1
ncount count, scaled to maximum of 1
ndensity density, scaled to maximum of 1
stat_density
The documentation for stat_density describes how stat_density returns a data.frame with the following (additional) columns
density density estimate
count density * number of points - useful for stacked density plots
scaled density estimate, scaled to maximum of 1
To produce a plot on the same scale it would appear that you want ..ndensity.. from stat_bin and ..scaled.. from stat_density or ..density.. from both
ggplot(dd, aes(x=x)) +
geom_histogram(aes(y= ..density..)) +
geom_density(aes(y=..density..))
ggplot(dd, aes(x=x)) +
geom_histogram(aes(y= ..ndensity..)) +
geom_density(aes(y=..scaled..))

Resources