apply jittering to outliers data in a boxplot with ggplot2

apply jittering to outliers data in a boxplot with ggplot2 - r

do you have any idea of how to apply jittering just to the outliers data of a boxplot? This is the code:
ggplot(data = a, aes(x = "", y = a$V8)) +
geom_boxplot(outlier.size = 0.5)+
geom_point(data=a, aes(x="", y=a$V8[54]), colour="red", size=3) +
theme_bw()+
coord_flip()
thank you!!

Added a vector to your data set to indicate which points are and are not outliers. Then, Set the geom_boxplot to not plot any outliers and use a geom_point to plot the outliers explicity.
I will use the diamonds data set from ggplot2 to illustrate.
library(ggplot2)
library(dplyr)
diamonds2 <-
diamonds %>%
group_by(cut) %>%
mutate(outlier = price > median(price) + IQR(price) * 1.5) %>%
ungroup
ggplot(diamonds2) +
aes(x = cut, y = price) +
geom_boxplot(outlier.shape = NA) + # NO OUTLIERS
geom_point(data = function(x) dplyr::filter_(x, ~ outlier), position = 'jitter') # Outliers

This is slightly different approach than above (assigns a color variable with NA for non-outliers), and includes a correction for the upper and lower bounds calculations.
The default "outlier" definition is a point beyond the 25/75th quartile +/- 1.5 x the interquartile range (IQR).
Generate some sample data:
set.seed(1)
a <- data_frame(x= factor(rep(1:4, each = 1000)),
V8 = c(rnorm(1000, 25, 4),
rnorm(1000, 50, 4),
rnorm(1000, 75, 4),
rnorm(1000, 100, 4)))
calculate the upper/lower limit outliers (uses dplyr/tidyverse functions):
library(tidyverse)
a <- a %>% group_by(x) %>%
mutate(outlier.high = V8 > quantile(V8, .75) + 1.50*IQR(V8),
outlier.low = V8 < quantile(V8, .25) - 1.50*IQR(V8))
Define a color for the upper/lower points:
a <- a %>% mutate(outlier.color = case_when(outlier.high ~ "red",
outlier.low ~ "steelblue"))
The unclassified cases will be coded as "NA" for color, and will not appear in the plot.
The dplyr::case_when() function is not completely stable yet (may require github development version > 0.5 at enter link description here), so here is a base alternative if that does not work:
a$outlier.color <- NA
a$outlier.color[a$outlier.high] <- "red"
a$outlier.color[a$outlier.low] <- "steelblue"
Plot:
a %>% ggplot(aes(x, V8)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(color = a$outlier.color, width = .2) + # NA not plotted
theme_bw() + coord_flip()

Related

How to remove low frequency bins in histogram

Let's say I've a data frame containing an array of numbers which I want to visualise in a histogram. What I want to achieve is to show only the bins containing more than let's say 50 observations.
Step 1
set.seed(10)
x <- data.frame(x = rnorm(1000, 50, 2))
p <-
x %>%
ggplot(., aes(x)) +
geom_histogram()
p
Step 2
pg <- ggplot_build(p)
pg$data[[1]]
As a check when I print the pg$data[[1]] I'd like to have only rows where count >= 50.
Thank you

library(ggplot2)
ggplot(x, aes(x=x, y = ifelse(..count.. > 50, ..count.., 0))) +
geom_histogram(bins=30)
With this code you can see the counts of the deleted bins:
library(ggplot2)
ggplot(x, aes(x=x, y = ifelse(..count.. > 50, ..count.., 0))) +
geom_histogram(bins=30, fill="green", color="grey") +
stat_bin(aes(label=..count..), geom="text", vjust = -0.7)

You could do something like this, most likely you do not really like the factorized names on the x-axis, but what you can do is split the two values and take the average to take that one to plot the x-axis.
x %>%
mutate(bin = cut(x, breaks = 30)) %>%
group_by(bin) %>%
mutate(count = n()) %>%
filter(count > 50) %>%
ggplot(., aes(bin)) +
geom_histogram(stat = "count")

Normal curves on multiple histograms on a same plot

My example dataframe:
sample1 <- seq(100,157, length.out = 50)
sample2 <- seq(113, 167, length.out = 50)
sample3 <- seq(95,160, length.out = 50)
sample4 <-seq(88, 110, length.out = 50)
df <- as.data.frame(cbind(sample1, sample2, sample3, sample4))
I have managed to create histograms for these four variables, which share the same y-axis. Now I need an overlay normal curve. Based on previous posts, I've managed a density curve, but this is not what I want. This comes close, but I'd like a smooth line...
This is my current code for plotting:
df <- as.data.table(df)
new.df<-melt(df,id.vars="sample")
names(new.df)=c("sample","type","value")
cdat <- ddply(new.df, "type", summarise, value.mean=mean(value))
ggplot(data = new.df,aes(x=value)) +
geom_histogram(aes(x = value), bins = 15, colour = "black", fill = "gray") +
facet_wrap(~ type) + geom_density(aes(x = value),alpha=.2, fill="#FF6666") +
geom_vline(data=cdat, aes(xintercept=value.mean),
linetype="dashed", size=1, colour="black") +
theme_classic() +
theme(text = element_text(size = 15), element_line(size = 0.5),aspect.ratio = 0.75 )
And I found the following code, which I hoped would do the trick, but this gives me nothing:
stat_function(fun = dnorm, args = list(mean = mean(df$value), sd = sd(df$value)))

Unfortunately, stat_function doesn't play nicely with facets: it overlays the same function on each facet without taking account of the faceting variable.
One of the most common reasons I see for people posting ggplot questions on Stack Overflow is that they get lost while trying to coerce ggplot to do too much of their data manipulation. Functions like geom_smooth and geom_function are useful helpers for common tasks, but if you want to do something that is complex or uncommon, it is best to produce the data you want to plot, then plot it.
In fact, the main author of ggplot2 recommends this approach for a very similar problem to yours in this thread, saying:
I think you are better off generating the data outside of ggplot2 and then plotting it. See https://speakerdeck.com/jennybc/row-oriented-workflows-in-r-with-the-tidyverse to get started.
Hadley Wickham, 26 April 2018
So here's one way of doing that using tidyverse. You create a data frame of the dnorm for each sample and plot these using plain old geom_line.
Note that your histograms are counts, so you either need to change them to density, or multiply the dnorm output by the number of observations * the binwidth, otherwise you will just get an apparently "flat" line on the x axis, since the dnorm values will all be so small in relation to the counts:
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
dfn <- df %>%
pivot_longer(everything()) %>%
ddply("name", function(x) {
xvar <- seq(min(x$value), max(x$value), length.out = 100)
data.frame(value = xvar,
y = 5 * nrow(x) * dnorm(xvar, mean(x$value), sd(x$value)))
})
df %>%
pivot_longer(everything()) %>%
group_by(name) %>%
mutate(mean = mean(value), sd = sd(value)) %>%
ggplot(aes(value)) +
geom_histogram(aes(x = value), binwidth = 5,
colour = "black", fill = "gray") +
facet_wrap(~ name) +
geom_vline(aes(xintercept = mean),
linetype = "dashed", size=1, colour="black") +
geom_line(data = dfn, aes(y = y)) +
theme_classic() +
theme(text = element_text(size = 15), element_line(size = 0.5),
aspect.ratio = 0.75 )
Created on 2020-12-07 by the reprex package (v0.3.0)

Why do I not get two legends using ggplot2?

I am plotting different models' prediction lines over some data points. I would like to get a legend indicating to which individual belongs each point colour and another legend indicating to which model belongs each line colour. Below I share a fake example for reproducibility:
set.seed(123)
df <- data.frame(Height =rnorm(500, mean=175, sd=15),
Weight =rnorm(500, mean=70, sd=20),
ID = rep(c("A","B","C","D"), (500/4)))
mod1 <- lmer(Height ~ Weight + (1|ID), df)
mod2 <- lmer(Height ~ poly(Weight,2) + (1|ID), df)
y.mod1 <- predict(mod1, data.frame(Weight=df$Weight),re.form=NA) # Prediction of y according to model 1
y.mod2 <- predict(mod2, data.frame(Weight=df$Weight),re.form=NA) # Prediction of y according to model 2
df <- cbind(df, y.mod1,y.mod2)
df <- as.data.frame(df)
head(df)
Height Weight ID y.mod1 y.mod2
1 166.5929 57.96214 A 175.9819 175.4918
2 171.5473 50.12603 B 176.2844 176.3003
3 198.3806 90.53570 C 174.7241 174.7082
4 176.0576 85.02123 D 174.9371 174.5487
5 176.9393 39.81667 A 176.6825 177.7303
6 200.7260 68.09705 B 175.5905 174.8027
First I plot my data points:
Plot_a <- ggplot(df,aes(x=Weight, y=Height,colour=ID)) +
geom_point() +
theme_bw() +
guides(color=guide_legend(override.aes=list(fill=NA)))
Plot_a
Then, I add lines relative to the prediction models:
Plot_b <- Plot_a +
geom_line(data = df, aes(x=Weight, y=y.mod1,color='mod1'),show.legend = T) +
geom_line(data = df, aes(x=Weight, y=y.mod2,color='mod2'),show.legend = T) +
guides(fill = guide_legend(override.aes = list(linetype = 0)),
color=guide_legend(title=c("Model")))
Plot_b
Does anyone know why I am not getting two different legends, one titled Model and the other ID?
I would like to get this

This type of problems generaly has to do with reshaping the data. The format should be the long format and the data is in wide format. See this post on how to reshape the data from long to wide format.
The plot layers become simpler, one geom_line is enough and there is no need for guideto override the aesthetics.
To customize the models' legend text, create a vector of legends, in this case with plotmath, in order to have math notation. And the colors are set manually too.
library(dplyr)
library(tidyr)
library(ggplot2)
model_labels <- c(expression(X^1), expression(X^2))
df %>%
pivot_longer(
cols = c(y.mod1, y.mod2),
names_to = "Model",
values_to = "Value"
) %>%
ggplot(aes(Weight, Height)) +
geom_point(aes(fill = ID), shape = 21) +
geom_line(aes(y = Value, color = Model)) +
scale_color_manual(labels = model_labels,
values = c("coral", "coral4")) +
theme_bw()

The issue is that in ggplot2 each aesthetic can only have one scale and only one legend. As you are using only the color aes you get one legend. If you want multiple legends for the same aesthetic have a look at the ggnewscales package. Otherwise you have to make use of a second aesthetic.
My preferred approach would be similar to the one proposed by #RuiBarradas. However, to stick close to your approach this could be achieved like so:
Instead of color map on linetype in your calls to geom_line.
Set the colors for the lines as arguments, i.e. not inside aes.
Make use of scale_linetype_manual to get solid lines for both models.
Make use of guide_legend to fix the colors appearing in the legend
library(ggplot2)
library(lme4)
#> Loading required package: Matrix
set.seed(123)
df <- data.frame(Height =rnorm(500, mean=175, sd=15),
Weight =rnorm(500, mean=70, sd=20),
ID = rep(c("A","B","C","D"), (500/4)))
mod1 <- lmer(Height ~ Weight + (1|ID), df)
mod2 <- lmer(Height ~ poly(Weight,2) + (1|ID), df)
y.mod1 <- predict(mod1, data.frame(Weight=df$Weight),re.form=NA) # Prediction of y according to model 1
y.mod2 <- predict(mod2, data.frame(Weight=df$Weight),re.form=NA) # Prediction of y according to model 2
df <- cbind(df, y.mod1,y.mod2)
df <- as.data.frame(df)
Plot_a <- ggplot(df) +
geom_point(aes(x=Weight, y=Height, colour=ID)) +
theme_bw() +
guides(color=guide_legend(override.aes=list(fill=NA)))
line_colors <- scales::hue_pal()(2)
Plot_b <- Plot_a +
geom_line(aes(x=Weight, y=y.mod1, linetype = "mod1"), color = line_colors[1]) +
geom_line(aes(x=Weight, y=y.mod2, linetype = "mod2"), color = line_colors[2]) +
scale_linetype_manual(values = c(mod1 = "solid", mod2 = "solid")) +
labs(color = "ID", linetype = "Model") +
guides(linetype = guide_legend(override.aes = list(color = line_colors)))
Plot_b

ggplot2 - a custom histogram with a rug plot

I am trying to create a custom histogram with a rug plot showing the original values on the X axis.
I am going to use the mtcars dataset to illustrate. Its not be best dataset for this question...but hopefully the reader will understand what I am trying to achieve...
Below shows the basic histogram, without any rug plot attempt.
I want to create the histogram using geom_bar as this allows for more flexibility with custom bins.
I also want a small gap between the histgram bars (i.e width = 0.95) .... which adds to this
problem's complexity.
library(dplyr)
library(ggplot2)
# create custom bins
vct_seq <- c(seq(from = 10, to = 25, by = 5), 34)
mtcars$bin <- cut(mtcars$mpg, breaks = vct_seq)
# create data.frame for the ggplot graph..using bins above
df_mtcars_count <- mtcars %>% group_by(bin) %>% summarise(count = n())
# indicative labels
vct_labels <- c("bin 1", "bin 2", "bin 3", "bin 4")
# attempt 1 - basic plot -- no rug plot
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p
Next, try and add a basic rug plot on the X axis. This obviously doesn't work as the geom_bar and geom_rug have completely different scales.
# attempt 2 with no scaling.... doesn't work as x scale for ordinal (bins) and
# x scale for continuous (mpg) do not match
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p <- p + geom_rug(data = mtcars, aes(x = mpg), inherit.aes = F, alpha = 0.3)
p
Now, try and rescale the mpg column to match with the ordinal scale....
First define a linear mapping function...
fn_linear_map <- function(vct_existing_val, vct_new_range) {
# example....converts 1:20 into the range 1 to 10 like this:
# fn_linear_map(1:20, c(1, 10))
fn_r_diff <- function(x) x %>% range() %>% diff()
flt_ratio <- fn_r_diff(vct_new_range) / fn_r_diff(vct_existing_val)
vct_old_min_offset <- vct_existing_val - min(vct_existing_val)
vct_new_range_val <- (vct_old_min_offset * flt_ratio) + min(vct_new_range)
return(vct_new_range_val)
}
Now apply the function...we try and map mpg to the range 1 to 4 (which is an attempt to match
the ordinal scale)
mtcars$mpg_remap <- fn_linear_map(mtcars$mpg, c(1, 4))
Try the plot again.... getting closer ... but not really accurate...
# attempt 3: getting closer but doesn't really match the ordinal scale
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p <- p + geom_rug(data = mtcars, aes(x = mpg_remap), inherit.aes = F, alpha = 0.3)
p
The graph above is getting close to what I want....but rug plot does not line up
with the actual data ... example the max observation (33.9) should be displayed
almost aligning with the right hand side of the bar.. see below:
mtcars %>% filter(bin == "(25,34]") %>% arrange(mpg) %>% dplyr::select(mpg, mpg_remap)

Your scale makes no sense to me, as you are showing a bin that is twice as wide using the same bar width. Doing that in combination with a rug strikes me as confusing as best and misleading at worst. I suggest you plot the bars with their correct widths, after which the rug is trivial.
I think the best solution is to just use geom_histogram:
ggplot(mtcars, aes(mpg)) +
geom_histogram(breaks = vct_seq, col = 'grey80') +
geom_rug(aes(mpg, y = NULL))
If you really want the gaps between the bars you'll have to do more work:
library(tidyr)
d <- mtcars %>%
count(bin) %>%
separate(bin, c('min', 'max'), sep = ',', remove = FALSE) %>%
mutate_at(vars('min', 'max'), readr::parse_number) %>%
mutate(
middle = min + (max - min) / 2,
width = 0.9 * (max - min)
)
ggplot(d, aes(middle, n)) +
geom_col(width = d$width) +
geom_rug(aes(mpg, y = NULL), mtcars)

Grouping data outside limits in histogram using ggplot2

I am trying to do a histogram zoomed on part of the data. My problem is that I would like to grup everything that is outside the range into last category "10+". Is it possible to do it using ggplot2?
Sample code:
x <- data.frame(runif(10000, 0, 15))
ggplot(x, aes(runif.10000..0..15.)) +
geom_histogram(aes(y = (..count..)/sum(..count..)), colour = "grey50", binwidth = 1) +
scale_y_continuous(labels = percent) +
coord_cartesian(xlim=c(0, 10)) +
scale_x_continuous(breaks = 0:10)
Here is how the histogram looks now:
How the histogram looks now
And here is how I would like it to look:
How the histogram should look
Probably it is possibile to do it by nesting ifelses, but as I have in my problem more cases is there a way for ggplot to do it?

You could use forcats and dplyr to efficiently categorize the values, aggregate the last "levels" and then compute the percentages before the plot. Something like this should work:
library(forcats)
library(dplyr)
library(ggplot2)
x <- data.frame(x = runif(10000, 0, 15))
x2 <- x %>%
mutate(x_grp = cut(x, breaks = c(seq(0,15,1)))) %>%
mutate(x_grp = fct_collapse(x_grp, other = levels(x_grp)[10:15])) %>%
group_by(x_grp) %>%
dplyr::summarize(count = n())
ggplot(x2, aes(x = x_grp, y = count/10000)) +
geom_bar(stat = "identity", colour = "grey50") +
scale_y_continuous(labels = percent)
However, the resulting graph is very different from your example, but I think it's correct, since we are building a uniform distribution:

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

apply jittering to outliers data in a boxplot with ggplot2 - r

do you have any idea of how to apply jittering just to the outliers data of a boxplot? This is the code: ggplot(data = a, aes(x = "", y = a$V8)) + geom_boxplot(outlier.size = 0.5)+ geom_point(data=a, aes(x="", y=a$V8[54]), colour="red", size=3) + theme_bw()+ coord_flip() thank you!!

Related

How to remove low frequency bins in histogram

Normal curves on multiple histograms on a same plot

Why do I not get two legends using ggplot2?

ggplot2 - a custom histogram with a rug plot

Grouping data outside limits in histogram using ggplot2

Categories

Resources