Changing binwidth of density histogram so that probabilities sum up to 1

Changing binwidth of density histogram so that probabilities sum up to 1 - r

I have already found numerous of questions to it, but somehow it did not really help me. I do not understand how to change the binwidth in a density histogram in ggplot2, so that the probabilities sum up to 1. It seems like it only works if the binwidth is exactly 1.
Here is an example:
set.seed(1)
df = data.frame("data" = runif(1000, min=0, max=100))
a = ggplot(data = df, aes(x = data))+
geom_histogram(aes(y=..density..),colour="black", fill = "white",
breaks=seq(0, 100, by = 50))
b = ggplot(data = df, aes(x = data))+
geom_histogram(aes(y =..density..),
breaks=seq(0, 100, by = 30),
col="black",
fill="white")
c = ggplot(data = df, aes(x = data))+
geom_histogram(aes(y =..density..),
breaks=seq(0, 100, by = 10),
col="black",
fill="white")
d = ggplot(data = df, aes(x = data))+
geom_histogram(aes(y =..density..),
breaks=seq(0, 100, by = 1),
col="black",
fill="white")
grid.arrange(a,b,c,d, ncol= 2)
If you look at the probability axis, you can see that the first three graphs must be wrong. These are not the right histograms as the bins do not sum up to 1. The y-axis even does not change significantly according to the histogram a, b, c or d. I also tried to replace the "breaks" command by the "binwidth" command, but it is even worse then.
I would also like to know how you can count the probabilities of the single bins of a histogram to proof that it sums up to 1 or not?
Thanks for any help.

Simulate some data:
library(ggplot2)
library(dplyr)
set.seed(1)
df = data.frame("data" = runif(1000, min=0, max=100))
The first plot you can get is:
# y axis has the density estimate values
ggplot(data = df, aes(x = data))+
geom_histogram(aes(y=..density..),colour="black", fill = "white",
breaks=seq(0, 100, by = 50))
This plot has the density estimates on the y axis. Those values correspond to the density plot and not to the bars you created. You can see this version where the density plot is overlayed:
# y axis has the density estimate values and the density plot
ggplot(data = df, aes(x = data))+
geom_histogram(aes(y=..density..),colour="black", fill = "white",
breaks=seq(0, 100, by = 50)) +
geom_density(aes(data), col="red")
A way to interpret this is that each point on the red line has a probability to be selected and that's on the y axis (i.e. lots of points means that probabilities tend closer to zero).
You can get what you want with this:
# y axis has the probabilities of each bar (bar counts / all counts)
ggplot(data = df, aes(x = data))+
geom_histogram(aes(y=..count../sum(..count..)),colour="black", fill = "white",
breaks=seq(0, 100, by = 50))
Another way to do the above, while keeping the data (for future usage or just check probabilities sum to 1) is this:
# assign the breaks
breaks = cut(df$data, seq(0, 100, by = 50))
# count observations in each bar and probability of each bar
df %>%
mutate(Breaks = breaks) %>%
count(Breaks) %>%
mutate(Prc = n/sum(n))
# # A tibble: 2 x 3
# Breaks n Prc
# <fctr> <int> <dbl>
# 1 (0,50] 520 0.52
# 2 (50,100] 480 0.48
# plot the above
df %>%
mutate(Breaks = breaks) %>%
count(Breaks) %>%
mutate(Prc = n/sum(n)) %>%
ggplot(aes(Breaks, Prc)) + geom_col()

Related

How to keep default axis labels but add an additional label in ggplot2

I would like to keep the default labels ggplot2 provides for Y-axis below, but always have a Y-axis tick and/or label at y = 100 to highlight the horizontal line intercept.
library(ggplot2)
maxValue <- 1000
df <- data.frame(
var1 = seq(1, maxValue, by = 25),
var2 = seq(1, maxValue, by = 50)
)
ggplot(df, aes(x = var1, y = var2)) +
geom_point() +
geom_hline(yintercept = 100, color = "red")
Created on 2022-04-09 by the reprex package (v2.0.1.9000)
Expected output:
Note that maxValue can be anything. So the solution to just increase in steps of 100 doesn't work. For example:
plot <- plot +
scale_y_continuous(
breaks = seq(0, max(df$y) + 100, 100),
labels = as.character(seq(0, max(df$y) + 100, 100))
)
This is because if the max value is 10000 or a similar big number like that, the number of labels will be overwhelming. This is why I would like to stay with the default Y-axis labels that ggplot2 provides and only add a single additional label at y = 100.

By default ggplot2 will compute the default axis breaks in the following manner (Refer to this answer for more details):
labeling::extended(min(df$var1), max(df$var1), m = 5))
We can just add your custom value 100 to this vector and pass it to scale_y_continous
def_breaks <- labeling::extended(min(df$var1), max(df$var1), m = 5)
ggplot(df, aes(x = var1, y = var2)) +
geom_point() +
geom_hline(yintercept = 100, color = "red") +
scale_y_continuous(breaks = c(100, def_breaks),
# pass to minor breaks so that they are not messed up
minor_breaks = def_breaks)

ggplot: transperancy of histogram as function of stat(count)

I'm trying to make a scaled histogram in a such a way, that transparency of each "column" (bin?) depends on the number of observations in a given range of x. Here is my code:
set.seed(1)
test = data.frame(x = rnorm(200, mean = 0, sd = 10),
y = as.factor(sample(c(0,1), replace=TRUE, size=100)))
threshold = 20
ggplot(test,
aes(x = x))+
geom_histogram(aes(fill = y, alpha = stat(count) > threshold),
position = "fill", bins = 10)
Basically I want to make plots that will looks like this:
however my code generate the plots there transparency are applied based on the count after grouping that ends up with hanging column like this:
For this example, in order to simulate a "proper" plot I just adjust the threshold, but I need alpha to consider sum of count from both groups in a given "column"(bin).
UPDATE:
I also want it to work with faceted plots in a such a way that highlighted area in each facet was independent from other facets. Approach that proposed #Stefan works perfect for the individual plot, but in faceted plot highlights the same area at all facets.
library(ggplot2)
set.seed(1)
test = data.frame(x = rnorm(1000, mean = 0, sd = 10),
y = as.factor(sample(c(0,1), replace=TRUE, size=1000)),
n = as.factor(sample(c(0,1,2), replace=TRUE, size=1000)),
m = as.factor(sample(c(0,1,3,4), replace=TRUE, size=1000)))
f = function(..count.., ..x..) tapply(..count.., factor(..x..), sum)[factor(..x..)]
threshold = 10
ggplot(test,
aes(x = x))+
geom_histogram(aes(fill = y, alpha = f(..count.., ..x..) > threshold),
position = "fill", bins = 10)+
facet_grid(rows = vars(n),
cols = vars(m))

This could be achieved like so:
As the count computed by stat_count is the number of obs after grouping we have to manually aggregate the count over groups to get the total count per bin.
To aggregate the counts per bin I use tapply, where I make use of the .. notation to get the variables computed by stat_count.
As the grouping variable I make use of the computed variable ..x.. which to the best of my knowledge is not documented. Basically ..x.. contains by default the midpoints of the bins and as such can be used as an identifier for the bins. However, as these are continuous values we have convert them to a factor.
Finally, to make the code more readable I use a auxilliary function to compute the aggregate counts. Additionally I double the threshold value to 20.
library(ggplot2)
set.seed(1)
test <- data.frame(
x = rnorm(200, mean = 0, sd = 10),
y = as.factor(sample(c(0, 1), replace = TRUE, size = 100))
)
threshold <- 20
f <- function(..count.., ..x..) tapply(..count.., factor(..x..), sum)[factor(..x..)]
p <- ggplot(
test,
aes(x = x)
) +
geom_histogram(aes(fill = y, alpha = f(..count.., ..x..) > threshold),
position = "fill", bins = 10
)
p
EDIT To allow for facetting we have to pass the function the ..PANEL.. identifier as an addtional argument. Instead of using tapply I now use dplyr::group_by and dplyr::add_count to compute the total count per bin and facet panel:
library(ggplot2)
library(dplyr)
set.seed(1)
test <- data.frame(
x = rnorm(200, mean = 0, sd = 10),
y = as.factor(sample(c(0, 1), replace = TRUE, size = 100)),
type = rep(c("A", "B"), each = 100)
)
threshold <- 20
f <- function(count, x, PANEL) {
data.frame(count, x, PANEL) %>%
add_count(x, PANEL, wt = count) %>%
pull(n)
}
p <- ggplot(
test,
aes(x = x)
) +
geom_histogram(aes(fill = y, alpha = f(..count.., ..x.., ..PANEL..) > threshold),
position = "fill", bins = 10
) +
facet_wrap(~type)
p
#> Warning: Using alpha for a discrete variable is not advised.
#> Warning: Removed 2 rows containing missing values (geom_bar).

Dodging vertical lines for median_hilow in ggplot

I need to plot lines that show median and IQR for 3 replicates, across multiple samples.
Data:
sampleid <- rep(1:20, each = 3)
replicate <- rep(1:3, 20)
sample1 <- seq(120,197, length.out = 60)
sample2 <- seq(113, 167, length.out = 60)
sample3 <- seq(90,180, length.out = 60)
What I have done so far?
df <- as.data.frame(cbind(sampleid,replicate,sample1, sample2, sample3))
library(reshape2)
long <- melt(df,id.vars = c('sampleid', 'replicate'))
ggplot(data = long, aes(x = variable, y = value, colour = factor(replicate))) + stat_summary(fun.data=median_hilow, conf.int=.5)
However, the plot of the IQR for replicates that I am getting are overlapped with each other for each sample. I would like to find out a way to "dodge" these 3 lines so that they are visible next to each other, without changing other parameters of the plot that I have achieved. Is this achievable?

You have to introduce jitter to the lines:
ggplot(data = long, aes(x = variable, y = value, colour = factor(replicate))) +
stat_summary(fun.data=median_hilow, fun.args = (conf.int=.5), position = "jitter")
Please note you also need to have your conf.int=5 wrapped in the fun.args.
Alternatively, change your x to factor(replicate) and add facet_wrap:
ggplot(data = long, aes(x = factor(replicate), y = value, colour = factor(replicate))) +
stat_summary(fun.data=median_hilow, fun.args = (conf.int=.5)) +
facet_wrap(~variable)

Advice/ on how to plot side by side histograms with line graph going through in ggplot2

I'm currently finishing off my Masters project and need to include some graphics for the write-up. Without boring you too much, I have some data which is associated with AR(1) parameters ranging from 0.1 to 0.9 by 0.1 increments. As such I thought of doing a faceted histogram like the one below (worry not about the hideous fruit salad of colours, it will not be used).
I used this code.
ggplot(opt_lens_geom,aes(x=l_1024,fill=factor(rho))) + geom_histogram()+coord_flip()+facet_grid(.~rho,scales = "free_x")
I also would like to draw a trend line for the median values since the AR(1) parameter is continuous. In a later iteration I deleted the padding and made it "look" like it was one graph, but I have had issues with the endpoints matching up since each facet is a separate graphical device. Can anyone give me some advice on how to do this? I am not particularly partial to the faceting so if it is not needed I do away with it.
I will try and upload sample data, but all simulating 100 values for each of the 9 rhos would work just to get it started like:
opt_lens_geom <- data.frame(rho= rep(seq(0.1,0.9,by=0.1),each=100),l_1024=rnorm(900))

You might consider ggridges. I've assumed here that you want a median value for each value of rho.
library(ggplot2)
library(ggridges)
library(dplyr)
set.seed(1001)
opt_lens_geom <- data.frame(rho = rep(seq(0.1, 0.9, by = 0.1), each = 100),
l_1024 = rnorm(900))
opt_lens_geom %>%
mutate(rho_f = factor(rho)) %>%
ggplot(aes(l_1024, rho_f)) +
stat_density_ridges(quantiles = 2, quantile_lines = TRUE)
Result. You can add scale = 1 as a parameter to stat_density_ridges if you don't like the amount of overlap.

Try the following. It uses a pre-computed data frame of the medians.
library(ggplot2)
df <- iris[c(1, 5)]
names(df) <- c("val", "rho")
med <- plyr::ddply(df, "rho", summarise, m = median(val))
ggplot(data = df, aes(x = val, fill = factor(rho))) +
geom_histogram() +
coord_flip() +
geom_vline(data = med, aes(xintercept = m), colour = 'black') +
facet_wrap(~ factor(rho))

You could do a variant on this using geom_violin instead of using histograms, although you wouldn't get labelled counts, just an idea of the relative density. Example with made up data:
df = data.frame(
rho = rep(c(0.1, 0.2, 0.3), each = 50),
val = sample(1:10, 150, replace = TRUE)
)
df$val = df$val + (5 * (df$rho == 0.2)) + (8 * (df$rho == 0.3))
ggplot(df, aes(x = rho, y = val, fill = factor(rho))) +
geom_violin() +
stat_summary(aes(group = 1), colour = "black",
geom = "line", fun.y = "median")
This produces a violin for each value of rho, and joins the medians for each violin.

Residual plot with ggplot with X-axis as "ranked" residuals

I'm trying to re-create a plot like this in ggplot:.
This graph takes the residuals from a regression output, and plots them in order (with the X-axis being a rank of residuals).
My best attempt at this was something like the following:
library(ggplot2)
library(modelr)
d <- d %>% add_residuals(mod1, var = "resid")
d$resid_rank <- rank(d$resid)
ggplot(data = d, aes(x = resid_rank, y = resid)) +
geom_bar(stat="identity") +
theme_bw()
However, this yields a completely blank graph. I tried something like this:
ggplot(data = d, aes(x = resid_rank, y = resid)) +
geom_segment(yend = 0, aes(xend=resid)) +
theme_bw()
But this yields the segments that go in the wrong direction. What is the right way to do this, and to color those lines by a third factor?
FAKE DATASET:
library(estimatr)
library(fabricatr)
#simulation
dat <- fabricate(
N = 10000,
y = runif(N, 0, 10),
x = runif(N, 0, 100)
)
#add an outlier
dat <- rbind(dat, c(300, 5))
dat <- rbind(dat, c(500, 3))
dat$y_log <- log(dat$y)
dat$x_log <- log(dat$x)
dat$y_log_s <- scale(log(dat$y))
dat$x_log_s <- scale(log(dat$x))
mod1 <- lm(y_log ~ x_log, data = dat))

I used the build in dataset from the help page on lm() to create this example. I also just directly used resid() to get the residuals. It's unclear where / why the colored bars would be different, but basically you'd need to add a column to your data.frame that specificies why they are red or blue, then pass that to fill.
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 3.4.4
#example from lm
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
resids <- data.frame(resid = resid(lm.D9))
#why are some bars red and some blue? No clue - so I'll pick randomly
resids$group <- sample(c("group 1", "group 2"), nrow(resids), replace = TRUE)
#rank
resids$rank <- rank(-1 * resids$resid)
ggplot(resids, aes(rank, resid, fill = group)) +
geom_bar(stat = "identity", width = 1) +
geom_hline(yintercept = c(-1,1), colour = "darkgray", linetype = 2) +
geom_hline(yintercept = c(-2,2), colour = "lightgray", linetype = 1) +
theme_bw() +
theme(panel.grid = element_blank()) +
scale_fill_manual(values = c("group 1" = "red", "group 2" = "blue"))
Created on 2019-01-24 by the reprex package (v0.2.1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Changing binwidth of density histogram so that probabilities sum up to 1 - r

Related

How to keep default axis labels but add an additional label in ggplot2

ggplot: transperancy of histogram as function of stat(count)

Dodging vertical lines for median_hilow in ggplot

Advice/ on how to plot side by side histograms with line graph going through in ggplot2

Residual plot with ggplot with X-axis as "ranked" residuals

Categories

Resources