Kernel Density Estimate (Probability Density Function) is wrong?

Kernel Density Estimate (Probability Density Function) is wrong? - r

I've created a histogram to show the density of the age at which serial killers first killed and have tried to superimpose a probability density function on this. However, when I use the geom_density() function in ggplot2, I get a density function that looks far too small (area<1). What is strange is that by changing the bin width of the histogram, the density function also changes (the smaller the bin width, the seemingly better fitting the density function. I was wondering if anyone had some guidance to make this function fit better and its area is so far below 1?
#Histograms for Age of First Kill:
library(ggplot2)
AFKH <- ggplot(df, aes(AgeFirstKill,fill = cut(AgeFirstKill, 100))) +
geom_histogram(aes(y=..count../sum(..count..)), show.legend = FALSE, binwidth = 3) + # density wasn't working, so had to use the ..count/../sum(..count..)
scale_fill_discrete(h = c(200, 10), c = 100, l = 60) + # c =, for color, and l = for brightness, the #h = c() changes the color gradient
theme(axis.title=element_text(size=22,face="bold"),
plot.title = element_text(size=30, face = "bold"),
axis.text.x = element_text(face="bold", size=14),
axis.text.y = element_text(face="bold", size=14)) +
labs(title = "Age of First kill",x = "Age of First Kill", y = "Density")+
geom_density(aes(AgeFirstKill, y = ..density..), alpha = 0.7, fill = "white",lwd =1, stat = "density")
AFKH

We don't have your data set, so let's make one that's reasonably close to it:
set.seed(3)
df <- data.frame(AgeFirstKill = rgamma(100, 3, 0.2) + 10)
The first thing to notice is that the density curve doesn't change. Look carefully at the y axis on your plot. You will notice that the peak of the density curve doesn't change, but remains at about 0.06. It's the height of the histogram bars that change, and the y axis changes accordingly.
The reason for this is that you aren't dividing the height of the histogram bars by their width to preserve their area. Your y aesthetic should be ..count../sum(..count..)/binwidth to keep this constant.
To show this, let's wrap your plotting code in a function that allows you to specify the bin width but also takes the binwidth into account when plotting:
draw_it <- function(bw) {
ggplot(df, aes(AgeFirstKill,fill = cut(AgeFirstKill, 100))) +
geom_histogram(aes(y=..count../sum(..count..)/bw), show.legend = FALSE,
binwidth = bw) +
scale_fill_discrete(h = c(200, 10), c = 100, l = 60) +
theme(axis.title=element_text(size=22,face="bold"),
plot.title = element_text(size=30, face = "bold"),
axis.text.x = element_text(face="bold", size=14),
axis.text.y = element_text(face="bold", size=14)) +
labs(title = "Age of First kill",x = "Age of First Kill", y = "Density") +
geom_density(aes(AgeFirstKill, y = ..density..), alpha = 0.7,
fill = "white",lwd =1, stat = "density")
}
And now we can do:
draw_it(bw = 1)
draw_it(bw = 3)
draw_it(bw = 7)

Related

How to remove zig-zag pattern in marginal distribution plot of integer values in R?

I am including marginal distribution plots on a scatterplot of a continuous and integer variable. However, in the integer variable maringal distribution plot (y-axis) there is this zig-zag pattern that shows up because the y-values are all integers. Is there any way to increase the "width" (not sure that's the right term) of the bins/values the function calculates the distribution density over?
The goal is to get rid of that zig-zag pattern that develops because the y-values are integers.
library(GlmSimulatoR)
library(ggplot2)
library(patchwork)
### Create right-skewed dataset that has one continous variable and one integer variable
set.seed(123)
df1 <- data.frame(matrix(ncol = 2, nrow = 1000))
x <- c("int","cont")
colnames(df1) <- x
df1$int <- round(rgamma(1000, shape = 1, scale = 1),0)
df1$cont <- round(rgamma(1000, shape = 1, scale = 1),1)
p1 <- ggplot(data = df1, aes(x = cont, y = int)) +
geom_point(shape = 21, size = 2, color = "black", fill = "black", stroke = 1, alpha = 0.4) +
xlab("Continuous Value") +
ylab("Integer Value") +
theme_bw() +
theme(panel.grid = element_blank(),
text = element_text(size = 16),
axis.text.x = element_text(size = 16, color = "black"),
axis.text.y = element_text(size = 16, color = "black"))
dens1 <- ggplot(df1, aes(x = cont)) +
geom_density(alpha = 0.4) +
theme_void() +
theme(legend.position = "none")
dens2 <- ggplot(df1, aes(x = int)) +
geom_density(alpha = 0.4) +
theme_void() +
theme(legend.position = "none") +
coord_flip()
dens1 + plot_spacer() + p1 + dens2 +
plot_layout(ncol = 2, nrow = 2, widths = c(6,1), heights = c(1,6))

From ?geom_density:
adjust: A multiplicate [sic] bandwidth adjustment. This makes it possible
to adjust the bandwidth while still using the a bandwidth
estimator. For example, ‘adjust = 1/2’ means use half of the
default bandwidth.
So as a start try e.g. geom_density(..., adjust = 2) (bandwidth twice as wide as default) and go from there.

Why is the resolution on my geom_points so poor

I came across an alternative to grouped bar charts in ggplot that Rebecca Barter posted on her blog and wanted to give it a try. It produces a slick Cleveland dot plot:
The code for my attempt follows:
ggplot() +
# remove axes and superfluous grids
theme_classic() +
theme(axis.ticks.y = element_blank(),
text = element_text(family = "Roboto Condensed"),
axis.text = element_text(size = rel(1.5)),
plot.title = element_text(size = 30, color = "#000000"),
plot.subtitle = element_text(size = 15, color = "#Ec111A"),
plot.caption = element_text(size = 15, color = "grey25"),
plot.margin = margin(20,20,20,20),
panel.background = element_rect(fill = "white"),
axis.line = element_blank(),
axis.text.x = element_text(vjust= + 15)) +
# add a dummy point for scaling purposes
geom_point(aes(x = 15, y = P),
size = 0, col = "white") +
# add the horizontal discipline lines
geom_hline(yintercept = 1:9, color = "grey80") +
# add a point for each male success rate
geom_point(aes(x = Male, y = P),
size = 15, col = "#00b0f0") +
# add a point for each female success rate
geom_point(aes(x = Female, y = P),
size = 15, col = "#Ec111A") +
geom_text(aes(x = Male, y = P,
label = paste0(round(Male, 1))),
col = "black", face="bold") +
# add the text (%) for each female success rate
geom_text(aes(x = Female, y = P,
label = paste0(round(Female, 1))),
col = "white", face="bold") +
# add a label above the first two points
geom_text(aes(x = x, y = y, label = label, col = label),
data.frame(x = c(21.8 - 0, 24.6 - 0), y = 7.5,
label = c("Male", "Female")), size = 6) +
scale_color_manual(values = c("#Ec111A", "#00b0f0"), guide = "none") +
# manually specify the x-axis
scale_x_continuous(breaks = c(0, 10, 20, 30),
labels = c("0%","10%", "20%", "30%")) +
# manually set the spacing above and below the plot
scale_y_discrete(expand = c(0.15, 0)) +
labs(
x = NULL,
y = NULL,
title= "Move Percentage By Gender",
subtitle = "What Percentage Of Moves Are Tops",
caption = "Takeaway: Males have fewer Tops and more Xs compared to Females.")
But my plot has very jagged (poor resolution points) and I can't figure out what's the cause.
Has anyone come across this problem and know how to fix it?

Saving and resolution depends on how you save and your graphics device. In other words... how are you saving your plot? Since it depends so much on your personal setup and parameters, your mileage will vary. One of the more dependable ways of saving plots from ggplot2 in R is to use ggsave(), where you can specify these parameters and maintain some consistency. Here is an example plot code:
ggplot(mtcars, aes(disp, mpg)) +
geom_point(size=10, color='red1') +
geom_text(aes(label=cyl), color='white')
This creates a plot similar to what you show using mtcars. If I copy and paste the graphic output directly from R or use export (I'm using RStudio) this is what you get:
Not sure if you can tell, but the edges are jagged and it does not look clean on close inspection. Definitely not OK for me. However, here's the same plot saved using ggsave():
ggsave('myplot.png', width = 9, height = 6)
You should be able to tell that it's a lot cleaner, because it is saved with a higher resolution. File size on the first is 9 KB, whereas it's 62 KB on the second.
In the end - just play with the settings on ggsave() and you should find some resolution that works for you. If you just input ggsave('myplotfile.png'), you'll get the width/height settings that match your viewport window in RStudio. You can get an idea of the aspect and size and adjust accordingly. One more point - be cautious that text does not scale the same as geoms, so your circles will increase in size differently than the text.

GGplot with multiple/conditional scale_fill_manual for stat_density_ridges

Here's a question for ggplot experts...
My dataset has 432000 observations of 4 variables (one is numeric, the others are factors). Predictors has 6 levels, Estimate has 4 levels, Model has 2 levels. Value has a max of 2.6 and a min of -3. (I hope you can create data with that information.)
The plot set-up is a 4x6 faceted plot here's is a 2x3 example of the plot:
each row is a different level of a factor (Predictors)
each column a different level of another factor (Estimate)
there are two distributions within each mini-plot which is another factor (Model)
The goal is to plot:
the distributions in each column in a different color (blue, green, red, yellow) (according to Estimate)
within each mini-plot, the shade/hue of that color should be different (e.g., within the green column, repeat the order of colors according to Model)
fill the tails of two quantiles on each distribution of each mini-plot (as the tail lines in the picture indicate; color the tail from each line to the end of the tail in black/gray). The tails can be the same throughout the entire plot.
Here's an example of the code that I'm using. It doesn't plot the quantiles in a separate color:
pp <- ggplot(dd, aes(x=Value, y=as.factor(Model), fill=factor(Model))) +
stat_density_ridges(quantile_lines = TRUE, quantiles = c(0.05, 0.95), alpha = 0.95,vline_size = 0.5)+
scale_fill_manual(values = c("red", "white")) +
geom_vline(xintercept = 0, linetype="dashed", color = "black", size=0.5) +
facet_grid(Predictors~Estimate, scales = "free") + labs(x="Parameter value", y=" ") +
theme(text = element_text(size = 16)) + theme(axis.title=element_text(face="bold"), strip.text = element_text(
size = 16)) + theme(legend.position = "none")
To color the quantiles, you can swap fill=factor(Model) with fill=factor(..quantile..), but getting both "fills" in the same plot has been impossible thus far. Among many other things, I tried entering multiple factors into "fill", like this: fill=c(factor(Model), factor(Estimate), ..quantile..) , but it didn't work.
Any ideas?

I think from your description your data looks a bit like this (though I've limited it to 6000 rows):
set.seed(69)
Value <- rnorm(6000)
Predictors <- factor(rep(LETTERS[1:6], each = 1000))
Estimate <- factor(rep(rep(letters[1:4], each = 250), 6))
Model <- factor(rep(rep(c("Model1", "Model2"), each = 125), 24))
Value <- Value + rep(rnorm(6), each = 1000)
Value <- Value + rep(rep(rnorm(4), each = 250), 6)
Value <- Value + rep(rep(rnorm(2), each = 125), 24)
dd <- data.frame(Value, Predictors, Estimate, Model)
It sounds like most of what you want to do can be achieved by creating a new factor variable that is a conjunction of two other factors:
dd$fill_factor <- as.factor(paste0(Model, Estimate))
Which means that we should get close to the desired effect with minimal changes to your code:
library(ggplot2)
library(ggridges)
my_colors <- c("#0000FF", "#00FF00", "#FF0000", "#FFFF00")
ggplot(dd, aes(x = Value, y = Model, fill = fill_factor)) +
stat_density_ridges(quantile_lines = TRUE,
quantiles = c(0.05, 0.95),
alpha = 0.95,
vline_size = 0.5) +
scale_fill_manual(values = c(gsub("0", "6", my_colors),
gsub("F", "A", my_colors))) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black", size = 0.5) +
facet_grid(Predictors ~ Estimate, scales = "free") +
labs(x = "Parameter value", y = " ") +
theme(text = element_text(size = 16),
axis.title = element_text(face = "bold"),
strip.text = element_text(size = 16),
legend.position = "none")

ggplot2 2D Density plot - the gradient fill is too smooth

I am having some difficulty with the ggplot2 package and the gradient fill. For my data with low number of data points, its gradient and density intensity doesn't really match. Here is an example:
The code I am using is:
pt <- read.xlsx("plots.xlsx", sheetName = "PT1_TB varseq", stringsAsFactors=FALSE)
ggplot(pt, aes(x=pt$BAF, y=pt$LogR) ) +
stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
scale_fill_distiller(palette= "Spectral", direction=-1) +
scale_y_continuous(name="LogR", limits = c(-0.8, 0.6), breaks = seq(-0.8, 0.6, 0.2)) +
scale_x_continuous(name="BAF", breaks = seq(0, 0.8, 0.2)) +
theme(
legend.position='none',
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")
) +
geom_point(aes(shape = factor("cyl")), size = 1) + scale_shape(solid = FALSE)
I would like the gradient to change more abruptly, for example, I would like to see more seperation in colors between points at (0;0.2) and (0.25;-0.2). Furthermore the yellow color in the middle where no points are should be blue.
While I am at it, does anybody know how remove the white gap between the axes and the actual plot?
Thanks in advance :)

It would help if you could provide a reproducible example. However, to drive the point in the comment by #RichardTelford home, here's an example which leverages the manipulate package to interactively set the h bandwidth parameters, in addition to n -- the number of grid points.
library(ggplot2)
library(manipulate)
manipulate(
ggplot(faithful, aes(x = eruptions, y = waiting)) +
geom_point() +
xlim(0.5, 6) +
ylim(40, 110) +
stat_density_2d(geom = "raster", aes(fill = ..density..), contour = F,
h = c(x_bandwidth, y_bandwidth),
n = grid_points) +
scale_fill_distiller(palette = "Spectral", direction = -1),
x_bandwidth = slider(0.1, 20, 1, step = 0.1),
y_bandwidth = slider(0.1, 20, 1, step = 0.1),
grid_points = slider(1, 100, 16)
)
So our plain-vanilla (default) plot looks like this:
We can interactively change the parameters using the pop-up menu accessible from the gear icon:

Optimizing trimmed K-means for clustering of 2D data with many outliers? Better approach?

I have the following type of data/plot
Just looking at the datapoints alone it's pretty much impossible to judge where the peaks should be, but if drawn with 2D density smoothing in ggplot I get these really nice peaks, where I can visually count ~10 groups of points that I'd like to find. The exact number of "valid groups" is of course up for discussion.
Data here:
https://pastebin.com/5wquw7UF
library(ggplot2)
library(colorRamps)
library(tclust)
ggplot(data = df, aes(x = x, y = y)) +
stat_density2d(geom = "raster",
aes(fill = ..density..),
contour = FALSE) +
geom_point(col = "white", alpha = 0.1) +
scale_x_continuous(expand = c(0,0),
limits = c(0,1)) +
scale_y_continuous(expand = c(0,0),
limits = c(0,1)) +
theme_tufte(base_size = 11, base_family = "Helvetica") +
theme(axis.text = element_text(color = "black"),
panel.border = element_rect(colour = "black", fill=NA, size=0.7),
legend.key.height = unit(2.5,"line"),
legend.key.width = unit(1, "line")) +
scale_fill_gradientn(name = "Density",
colours = matlab.like(1000))
I've looked into trimmed clustering, with the package tclust. By fiddling around with the data I've been able to come up with the below. However, no matter how much I fiddle around with the parameters I can't seem to get groups that are as "tight" as I visually feel like I see. Especially group 5 seems to sneak its way into places it doesn't belong. Group 10 is also a bit odd, but isolated enough to discard afterwards.
Is there a way better method for this, or is it simply me not understanding how to set the parameters correctly?
set.seed(2)
trimmed_cluster <- tclust(
x = df,
k = 10, # 9
alpha = 0.1, # 0.1
drop.empty.clust = FALSE,
equal.weights = TRUE,
restr = c("sigma", "eigen"), # sigma
restr.fact = 1
)
df$cluster <- trimmed_cluster$cluster
trimmed_cluster_centers <- data.frame(t(trimmed_cluster$centers))
df_clustered <- subset(df, cluster != 0)
ggplot(data = df, aes(x = x, y = y)) +
stat_density2d(geom = "raster",
aes(fill = ..density..),
contour = FALSE) +
geom_point(data = df_clustered, aes(x = x, y = y, col = as.factor(cluster))) +
geom_text(data = trimmed_cluster_centers,
aes(x = x, y = y, label = as.character(1:length(trimmed_cluster_centers$x))),
size = 5,
fontface = "bold",
col = "yellow2") +
scale_x_continuous(expand = c(0,0),
limits = c(0,1)) +
scale_y_continuous(expand = c(0,0),
limits = c(0,1)) +
theme_tufte(base_size = 11, base_family = "Helvetica") +
theme(axis.text = element_text(color = "black"),
panel.border = element_rect(colour = "black", fill=NA, size=0.7),
legend.key.height = unit(0.8,"line"),
legend.key.width = unit(0.5, "line")) +
scale_fill_gradientn(name = "Density",
colours = matlab.like(1000)) +
scale_color_brewer(name = "cluster ID",
type = "qual",
palette = "Spectral")

Instead of k-means, I suggest that you use DBSCAN density-based clustering.
It is a well-tested and often used clustering algorithm to find density-connected components of arbitary shape.
The N in the name stands for noise, as the algorithm can "ignore" points that do not belong to any cluster (because of low density). It is fairly robust to noise, which may help you.

If you are looking for peaks in the density, the means shift algorithm may be helpful. As with any clustering algorithm, you may want to spend some time tuning the parameters, but I got something that seems plausible pretty quickly.
library(LPCM)
MS7 = ms(df, 0.07)
MS7$cluster.center
[,1] [,2]
1 0.55790817 0.46878846
2 0.42916901 0.60982702
3 0.04142821 0.63190748
4 0.58098385 0.03693459
5 0.01561478 0.19987934
6 0.18271326 0.01630580
7 0.80381893 0.65499869
8 0.59797721 0.88041362
9 0.86784436 0.95078057

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Kernel Density Estimate (Probability Density Function) is wrong? - r

Related

How to remove zig-zag pattern in marginal distribution plot of integer values in R?

Why is the resolution on my geom_points so poor

GGplot with multiple/conditional scale_fill_manual for stat_density_ridges

ggplot2 2D Density plot - the gradient fill is too smooth

Optimizing trimmed K-means for clustering of 2D data with many outliers? Better approach?

Categories

Resources