How to remove low frequency bins in histogram

How to remove low frequency bins in histogram - r

Let's say I've a data frame containing an array of numbers which I want to visualise in a histogram. What I want to achieve is to show only the bins containing more than let's say 50 observations.
Step 1
set.seed(10)
x <- data.frame(x = rnorm(1000, 50, 2))
p <-
x %>%
ggplot(., aes(x)) +
geom_histogram()
p
Step 2
pg <- ggplot_build(p)
pg$data[[1]]
As a check when I print the pg$data[[1]] I'd like to have only rows where count >= 50.
Thank you

library(ggplot2)
ggplot(x, aes(x=x, y = ifelse(..count.. > 50, ..count.., 0))) +
geom_histogram(bins=30)
With this code you can see the counts of the deleted bins:
library(ggplot2)
ggplot(x, aes(x=x, y = ifelse(..count.. > 50, ..count.., 0))) +
geom_histogram(bins=30, fill="green", color="grey") +
stat_bin(aes(label=..count..), geom="text", vjust = -0.7)

You could do something like this, most likely you do not really like the factorized names on the x-axis, but what you can do is split the two values and take the average to take that one to plot the x-axis.
x %>%
mutate(bin = cut(x, breaks = 30)) %>%
group_by(bin) %>%
mutate(count = n()) %>%
filter(count > 50) %>%
ggplot(., aes(bin)) +
geom_histogram(stat = "count")

Related

Plots for two variables within a group about each other with ggplot2

I want to use ggplot2 to plot two variables for multiple (in the example below: 4) individuals. Now I want that for every individual, the graphs for the two variables are about each other.
Example data:
da = data.frame(id = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), day = c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4), var1= c(3,4,2,1,2,2,2,3,4,4,5,3,2,1,2,3), var2 = c(1,1,1,2,2,2,1,2,2,1,2,1,1,1,1,2))
I can do the plots for the two variables separately:
da %>% ggplot(aes(x= day, y = var1)) + geom_line()+ facet_wrap(~id, nrow = 2)
da %>% ggplot(aes(x= day, y = var2)) + geom_line()+ facet_wrap(~id, nrow = 2)
I get two separate plots:
But what I want is this (...I moved the plots with Paint to show you what I need):

Try pivoting to longer:
library(tidyverse)
da %>%
pivot_longer(var1:var2) %>%
ggplot(aes(x = day, y = value)) + geom_line() + facet_grid(name ~ id)

I would suggest an approach using patchwork where you can arrange your plots as you desire. The solution of #arg0naut91 is a great way to tackle the issue but if you want to place plots without faceting you can use next code:
library(ggplot2)
library(tidyverse)
library(patchwork)
#Data
da = data.frame(id = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
day = c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
var1= c(3,4,2,1,2,2,2,3,4,4,5,3,2,1,2,3),
var2 = c(1,1,1,2,2,2,1,2,2,1,2,1,1,1,1,2))
#Plots
G1 <- da %>% ggplot(aes(x= day, y = var1)) + geom_line()+ facet_wrap(~id, nrow = 1)
G2 <- da %>% ggplot(aes(x= day, y = var2)) + geom_line()+ facet_wrap(~id, nrow = 1)
#Bind plots
G1/G2
wrap_plots(G1,G2,ncol = 1)
Output:

ggplot faceted cumulative histogram

I have the following data
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(100, 6, 1))
gender = rep(c("Male", "Female"), each=100)
mydata = data.frame(x=x, gender=gender)
and I want to plot two cumulative histograms (one for males and the other for females) with ggplot.
I have tried the code below
ggplot(data=mydata, aes(x=x, fill=gender)) + stat_bin(aes(y=cumsum(..count..)), geom="bar", breaks=1:10, colour=I("white")) + facet_grid(gender~.)
but I get this chart
that, obviously, is not correct.
How can I get the correct one, like this:
Thanks!

I would pre-compute the cumsum values per bin per group, and then use geom_histogram to plot.
mydata %>%
mutate(x = cut(x, breaks = 1:10, labels = F)) %>% # Bin x
count(gender, x) %>% # Counts per bin per gender
mutate(x = factor(x, levels = 1:10)) %>% # x as factor
complete(x, gender, fill = list(n = 0)) %>% # Fill missing bins with 0
group_by(gender) %>% # Group by gender ...
mutate(y = cumsum(n)) %>% # ... and calculate cumsum
ggplot(aes(x, y, fill = gender)) + # The rest is (gg)plotting
geom_histogram(stat = "identity", colour = "white") +
facet_grid(gender ~ .)

Like #Edo, I also came here looking for exactly this. #Edo's solution was the key for me. It's great. But I post here a few additions that increase the information density and allow comparisons across different situations.
library(ggplot2)
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(50, 6, 1))
gender = c(rep("Male", 100), rep("Female", 50))
grade = rep(1:3, 50)
mydata = data.frame(x=x, gender=gender, grade = grade)
ggplot(mydata, aes(x,
y = ave(after_stat(density), group, FUN = cumsum)*after_stat(width),
group = interaction(gender, grade),
color = gender)) +
geom_line(stat = "bin") +
scale_y_continuous(labels = scales::percent_format()) +
facet_wrap(~grade)
I rescale the y so that the cumulative plot always ends at 100%. Otherwise, if the groups are not the same size (like they are in the original example data) then the cumulative plots have different final heights. This obscures their relative distribution.
Secondly, I use geom_line(stat="bin") instead of geom_histogram() so that I can put more than one line on a panel. This way I can compare them easily.
Finally, because I also want to compare across facets, I need to make sure the ggplot group variable uses more than just color=gender. We set it manually with group = interaction(gender, grade).

Answering a million years later....
I was looking for a solution for the same problem and I got here..
Eventually I figured it out by myself, so I'll drop it here in case other people will ever need it.
As required: no pre-work is necessary!
ggplot(mydata) +
geom_histogram(aes(x = x, y = ave(..count.., group, FUN = cumsum),
fill = gender, group = gender),
colour = "gray70", breaks = 1:10) +
facet_grid(rows = "gender")

Grouping data outside limits in histogram using ggplot2

I am trying to do a histogram zoomed on part of the data. My problem is that I would like to grup everything that is outside the range into last category "10+". Is it possible to do it using ggplot2?
Sample code:
x <- data.frame(runif(10000, 0, 15))
ggplot(x, aes(runif.10000..0..15.)) +
geom_histogram(aes(y = (..count..)/sum(..count..)), colour = "grey50", binwidth = 1) +
scale_y_continuous(labels = percent) +
coord_cartesian(xlim=c(0, 10)) +
scale_x_continuous(breaks = 0:10)
Here is how the histogram looks now:
How the histogram looks now
And here is how I would like it to look:
How the histogram should look
Probably it is possibile to do it by nesting ifelses, but as I have in my problem more cases is there a way for ggplot to do it?

You could use forcats and dplyr to efficiently categorize the values, aggregate the last "levels" and then compute the percentages before the plot. Something like this should work:
library(forcats)
library(dplyr)
library(ggplot2)
x <- data.frame(x = runif(10000, 0, 15))
x2 <- x %>%
mutate(x_grp = cut(x, breaks = c(seq(0,15,1)))) %>%
mutate(x_grp = fct_collapse(x_grp, other = levels(x_grp)[10:15])) %>%
group_by(x_grp) %>%
dplyr::summarize(count = n())
ggplot(x2, aes(x = x_grp, y = count/10000)) +
geom_bar(stat = "identity", colour = "grey50") +
scale_y_continuous(labels = percent)
However, the resulting graph is very different from your example, but I think it's correct, since we are building a uniform distribution:

apply jittering to outliers data in a boxplot with ggplot2

do you have any idea of how to apply jittering just to the outliers data of a boxplot? This is the code:
ggplot(data = a, aes(x = "", y = a$V8)) +
geom_boxplot(outlier.size = 0.5)+
geom_point(data=a, aes(x="", y=a$V8[54]), colour="red", size=3) +
theme_bw()+
coord_flip()
thank you!!

Added a vector to your data set to indicate which points are and are not outliers. Then, Set the geom_boxplot to not plot any outliers and use a geom_point to plot the outliers explicity.
I will use the diamonds data set from ggplot2 to illustrate.
library(ggplot2)
library(dplyr)
diamonds2 <-
diamonds %>%
group_by(cut) %>%
mutate(outlier = price > median(price) + IQR(price) * 1.5) %>%
ungroup
ggplot(diamonds2) +
aes(x = cut, y = price) +
geom_boxplot(outlier.shape = NA) + # NO OUTLIERS
geom_point(data = function(x) dplyr::filter_(x, ~ outlier), position = 'jitter') # Outliers

This is slightly different approach than above (assigns a color variable with NA for non-outliers), and includes a correction for the upper and lower bounds calculations.
The default "outlier" definition is a point beyond the 25/75th quartile +/- 1.5 x the interquartile range (IQR).
Generate some sample data:
set.seed(1)
a <- data_frame(x= factor(rep(1:4, each = 1000)),
V8 = c(rnorm(1000, 25, 4),
rnorm(1000, 50, 4),
rnorm(1000, 75, 4),
rnorm(1000, 100, 4)))
calculate the upper/lower limit outliers (uses dplyr/tidyverse functions):
library(tidyverse)
a <- a %>% group_by(x) %>%
mutate(outlier.high = V8 > quantile(V8, .75) + 1.50*IQR(V8),
outlier.low = V8 < quantile(V8, .25) - 1.50*IQR(V8))
Define a color for the upper/lower points:
a <- a %>% mutate(outlier.color = case_when(outlier.high ~ "red",
outlier.low ~ "steelblue"))
The unclassified cases will be coded as "NA" for color, and will not appear in the plot.
The dplyr::case_when() function is not completely stable yet (may require github development version > 0.5 at enter link description here), so here is a base alternative if that does not work:
a$outlier.color <- NA
a$outlier.color[a$outlier.high] <- "red"
a$outlier.color[a$outlier.low] <- "steelblue"
Plot:
a %>% ggplot(aes(x, V8)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(color = a$outlier.color, width = .2) + # NA not plotted
theme_bw() + coord_flip()

How to do an association plot in ggplot2?

I have a table with two categorical values and I want to visualise their association; the number of times that they are found together in the same row.
For instance, let's take this data frame:
d <-data.frame(cbind(sample(1:5,100,replace=T), sample(1:10,100,replace=T)))
How can generate a heatmap like this:
Where the colour of the squares represent the number of times that X1 and X2 are found in a given combination.
It would be even better to know how to plot this with a dot plot instead, where the size of the dot represent the count of the combination occurrence between X1 and X2.
If you can guide me how to do this on ggplot2 or any other way in R, it would be really helpful.
Thanks!

Here's how I would do it:
library(ggplot2)
library(dplyr)
set.seed(123)
d <-data.frame(x = sample(1:5,100,replace=T), y = sample(1:10,100,replace=T))
d_sum <- d %>%
group_by(x, y) %>%
summarise(count = n())
For the heatmap:
ggplot(d_sum, aes(x, y)) +
geom_tile(aes(fill = count))
For the dotplot:
ggplot(d_sum, aes(x, y)) +
geom_point(aes(size = count))

library(ggplot2)
library(dplyr)
library(scales)
set.seed(123)
d <-data.frame(x = sample(1:20,1000,replace=T), y = sample(1:20,1000,replace=T))
d %>% count(x, y) %>% ggplot(aes(x, y, fill = n)) +
geom_tile() +
scale_x_continuous(breaks=1:20)+
scale_y_continuous(breaks=1:20)+
scale_fill_gradient2(low='white', mid='steelblue', high='red') +
guides(fill=guide_legend("Count")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme_bw()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to remove low frequency bins in histogram - r

Related

Plots for two variables within a group about each other with ggplot2

ggplot faceted cumulative histogram

Grouping data outside limits in histogram using ggplot2

apply jittering to outliers data in a boxplot with ggplot2

How to do an association plot in ggplot2?

Categories

Resources