Grouping data outside limits in histogram using ggplot2 - r

I am trying to do a histogram zoomed on part of the data. My problem is that I would like to grup everything that is outside the range into last category "10+". Is it possible to do it using ggplot2?
Sample code:
x <- data.frame(runif(10000, 0, 15))
ggplot(x, aes(runif.10000..0..15.)) +
geom_histogram(aes(y = (..count..)/sum(..count..)), colour = "grey50", binwidth = 1) +
scale_y_continuous(labels = percent) +
coord_cartesian(xlim=c(0, 10)) +
scale_x_continuous(breaks = 0:10)
Here is how the histogram looks now:
How the histogram looks now
And here is how I would like it to look:
How the histogram should look
Probably it is possibile to do it by nesting ifelses, but as I have in my problem more cases is there a way for ggplot to do it?

You could use forcats and dplyr to efficiently categorize the values, aggregate the last "levels" and then compute the percentages before the plot. Something like this should work:
library(forcats)
library(dplyr)
library(ggplot2)
x <- data.frame(x = runif(10000, 0, 15))
x2 <- x %>%
mutate(x_grp = cut(x, breaks = c(seq(0,15,1)))) %>%
mutate(x_grp = fct_collapse(x_grp, other = levels(x_grp)[10:15])) %>%
group_by(x_grp) %>%
dplyr::summarize(count = n())
ggplot(x2, aes(x = x_grp, y = count/10000)) +
geom_bar(stat = "identity", colour = "grey50") +
scale_y_continuous(labels = percent)
However, the resulting graph is very different from your example, but I think it's correct, since we are building a uniform distribution:

Related

Is there a way to add legend and count to each level for geom_point?

Is there a way to add a legend with the count to give density of each row?
Or an easier way to show it?
Thanks very much!
Couldn't even get a legend added :)
Code I used:
data %>%
ggplot(aes(x = subscribed, y = campaign)) +
geom_point () +
geom_jitter()
You could per group (subscribed) create a label which is calculated beforehand the number of n() observations and assign these as a column string. This can be used in the aes to make sure it is shown in the legend. Here is a reproducible example:
library(dplyr)
library(ggplot2)
df %>%
group_by(subscribed) %>%
mutate(count = paste0(subscribed, ' (n = ', n(), ')')) %>%
ggplot(aes(subscribed, campaign, colour = factor(count))) +
geom_jitter()
Created on 2023-01-12 with reprex v2.0.2
Created data:
df <- data.frame(campaign = runif(100),
subscribed = rep(c("no", "yes"), 50))
I found another way to show similar data to this, in a more clear manner.
However, I couldn't figure out the legend lol
The code I used was :
p <- ggplot(data = data, aes(x = subscribed, y = pdays)) +
geom_count() + scale_size_continuous(range = c(7, 30))
p + geom_text(data = ggplot_build(p)$data[[1]],
aes(x, y, label = n), color = "#ffffff") +
scale_y_continuous(breaks = seq(0, 30, by = 4))

How do I add data labels to a ggplot histogram with a log(x) axis?

I am wondering how to add data labels to a ggplot showing the true value of the data points when the x-axis is in log scale.
I have this data:
date <- c("4/3/2021", "4/7/2021","4/10/2021","4/12/2021","4/13/2021","4/13/2021")
amount <- c(105.00, 96.32, 89.00, 80.84, 121.82, 159.38)
address <- c("A","B","C","D","E","F")
df <- data.frame(date, amount, address)
And I plot it in ggplot2:
plot <- ggplot(df, aes(x = log(amount))) +
geom_histogram(binwidth = 1)
plot + theme_minimal() + geom_text(label = amount)
... but I get the error
"Error: geom_text requires the following missing aesthetics: y"
I have 2 questions as a result:
Why am I getting this error with geom_histogram? Shouldn't it assume to use count as the y value?
Will this successfully show the true values of the data points from the 'amount' column despite the plot's log scale x-axis?
Perhaps like this?
ggplot(df, aes(x = log(amount), y = ..count.., label = ..count..)) +
geom_histogram(binwidth = 1) +
stat_bin(geom = "text", binwidth = 1, vjust = -0.5) +
theme_minimal()
ggplot2 layers do not (at least in any situations I can think of) take the summary calculations of other layers, so I think the simplest thing would be to replicate the calculation using stat_bin(geom = "text"...
Or perhaps simpler, you could pre-calculate the numbers:
library(dplyr)
df %>%
count(log_amt = round(log(amount))) %>%
ggplot(aes(log_amt, n, label = n)) +
geom_col(width = 1) +
geom_text(vjust = -0.5)
EDIT -- to show buckets without the log transform we could use:
df %>%
count(log_amt = round(log(amount))) %>%
ggplot(aes(log_amt, n, label = n)) +
geom_col(width = 0.5) +
geom_text(vjust = -0.5) +
scale_x_continuous(labels = ~scales::comma(10^.),
minor_breaks = NULL)

How to remove extra column in facet_wrap plot with ggplot2?

I am trying to generate a facet plot with facet_wrap with an unbalanced grouped data, and it provided a plot with extra blank axis column.
Like the paragraph showed, I want to generate a plot without the rightmost axis column.
Here is an example code:
library(ggplot2)
name <- c(factor(letters[1:4]),factor(LETTERS[1:3]))
room <- rep(c('A','B'),c(4,3))
goal <- c(rnorm(7,mean=60,sd=10))
test <- data.frame(name,goal,room)
test %>% ggplot(aes(name, goal))+
facet_wrap(~factor(room))+
geom_bar(stat = "identity")
'scales="free"' way: automatic set, can it be set manually?
facetted_pos_scales in ggh4x developed by #teunbrand sovled the problem, thnaks! Here is the supplementary code:
library(ggh4x)
scales <- list(
scale_y_continuous(limits = c(0, 100)),
scale_y_continuous(limits = c(0, 80))
)
test %>% ggplot(aes(name, goal))+
facet_wrap(~factor(room), scales="free")+
geom_bar(stat = "identity")+
facetted_pos_scales(y=scales)
Update on comment of op:
Does this help: You can use coord_cartesian(ylim = c(0, 90))
to set the ylim:
test %>% ggplot(aes(name, goal))+
geom_bar(stat = "identity")+
coord_cartesian(ylim = c(0, 100)) +
facet_wrap(~factor(room), scales="free")
Use scales="free" instead of scales="free_x"
library(ggplot2)
name <- c(factor(letters[1:4]),factor(LETTERS[1:3]))
room <- rep(c('A','B'),c(4,3))
goal <- c(rnorm(7,mean=60,sd=10))
test <- data.frame(name,goal,room)
test %>% ggplot(aes(name, goal))+
facet_wrap(~factor(room), scales="free")+
geom_bar(stat = "identity")

Normal curves on multiple histograms on a same plot

My example dataframe:
sample1 <- seq(100,157, length.out = 50)
sample2 <- seq(113, 167, length.out = 50)
sample3 <- seq(95,160, length.out = 50)
sample4 <-seq(88, 110, length.out = 50)
df <- as.data.frame(cbind(sample1, sample2, sample3, sample4))
I have managed to create histograms for these four variables, which share the same y-axis. Now I need an overlay normal curve. Based on previous posts, I've managed a density curve, but this is not what I want. This comes close, but I'd like a smooth line...
This is my current code for plotting:
df <- as.data.table(df)
new.df<-melt(df,id.vars="sample")
names(new.df)=c("sample","type","value")
cdat <- ddply(new.df, "type", summarise, value.mean=mean(value))
ggplot(data = new.df,aes(x=value)) +
geom_histogram(aes(x = value), bins = 15, colour = "black", fill = "gray") +
facet_wrap(~ type) + geom_density(aes(x = value),alpha=.2, fill="#FF6666") +
geom_vline(data=cdat, aes(xintercept=value.mean),
linetype="dashed", size=1, colour="black") +
theme_classic() +
theme(text = element_text(size = 15), element_line(size = 0.5),aspect.ratio = 0.75 )
And I found the following code, which I hoped would do the trick, but this gives me nothing:
stat_function(fun = dnorm, args = list(mean = mean(df$value), sd = sd(df$value)))
Unfortunately, stat_function doesn't play nicely with facets: it overlays the same function on each facet without taking account of the faceting variable.
One of the most common reasons I see for people posting ggplot questions on Stack Overflow is that they get lost while trying to coerce ggplot to do too much of their data manipulation. Functions like geom_smooth and geom_function are useful helpers for common tasks, but if you want to do something that is complex or uncommon, it is best to produce the data you want to plot, then plot it.
In fact, the main author of ggplot2 recommends this approach for a very similar problem to yours in this thread, saying:
I think you are better off generating the data outside of ggplot2 and then plotting it. See https://speakerdeck.com/jennybc/row-oriented-workflows-in-r-with-the-tidyverse to get started.
Hadley Wickham, 26 April 2018
So here's one way of doing that using tidyverse. You create a data frame of the dnorm for each sample and plot these using plain old geom_line.
Note that your histograms are counts, so you either need to change them to density, or multiply the dnorm output by the number of observations * the binwidth, otherwise you will just get an apparently "flat" line on the x axis, since the dnorm values will all be so small in relation to the counts:
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
dfn <- df %>%
pivot_longer(everything()) %>%
ddply("name", function(x) {
xvar <- seq(min(x$value), max(x$value), length.out = 100)
data.frame(value = xvar,
y = 5 * nrow(x) * dnorm(xvar, mean(x$value), sd(x$value)))
})
df %>%
pivot_longer(everything()) %>%
group_by(name) %>%
mutate(mean = mean(value), sd = sd(value)) %>%
ggplot(aes(value)) +
geom_histogram(aes(x = value), binwidth = 5,
colour = "black", fill = "gray") +
facet_wrap(~ name) +
geom_vline(aes(xintercept = mean),
linetype = "dashed", size=1, colour="black") +
geom_line(data = dfn, aes(y = y)) +
theme_classic() +
theme(text = element_text(size = 15), element_line(size = 0.5),
aspect.ratio = 0.75 )
Created on 2020-12-07 by the reprex package (v0.3.0)

Adding a single label per group in ggplot with stat_summary and text geoms

I would like to add counts to a ggplot that uses stat_summary().
I am having an issue with the requirement that the text vector be the same length as the data.
With the examples below, you can see that what is being plotted is the same label multiple times.
The workaround to set the location on the y axis has the effect that multiple labels are stacked up. The visual effect is a bit strange (particularly when you have thousands of observations) and not sufficiently professional for my purposes. You will have to trust me on this one - the attached picture doesn't fully convey the weirdness of it.
I was wondering if someone else has worked out another way. It is for a plot in shiny that has dynamic input, so text cannot be overlaid in a hardcoded fashion.
I'm pretty sure ggplot wasn't designed for the kind of behaviour with stat_summary that I am looking for, and I may have to abandon stat_summary and create a new summary dataframe, but thought I would first check if someone else has some wizardry to offer up.
This is the plot without setting the y location:
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_x <- df_x %>%
group_by(Group) %>%
mutate(w_count = n())
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(label = w_count)) +
coord_flip() +
theme_classic()
and this is with my hack
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(y = 1, label = w_count)) +
coord_flip() +
theme_classic()
Create a df_text that has the grouped info for your labels. Then use annotate:
library(dplyr)
library(ggplot2)
set.seed(123)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_text <- df_x %>%
group_by(Group) %>%
summarise(avg = mean(Value),
n = n()) %>%
ungroup()
yoff <- 0.0
xoff <- -0.1
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
annotate("text",
x = 1:2 + xoff,
y = df_text$avg + yoff,
label = df_text$n) +
coord_flip() +
theme_classic()
I found another way which is a little more robust for when the plot is dynamic in its ordering and filtering, and works well for faceting. More robust, because it uses stat_summary for the text.
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
counts_df <- function(y) {
return( data.frame( y = 1, label = paste0('n=', length(y)) ) )
}
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
coord_flip() +
theme_classic()
p + stat_summary(geom="text", fun.data=counts_df)

Resources