Create a table with values from ecdf graph - r

I am trying to create a table using values from an ecdf plot. I've recreated an example below.
#Data
data(mtcars)
#Sort by mpg
mtcars <- mtcars[order(mtcars$mpg),]
#Make arbitrary ranking variable based on mpg
mtcars <- mtcars %>% mutate(Rank = dense_rank(mpg))
#Make variable for percent picked
mtcars <- mutate(mtcars, Percent_Picked = Rank/max(mtcars$Rank))
#Make cyl categorical
mtcars$cyl<-cut(mtcars$cyl, c(3,5,7,9), right=FALSE, labels=c(4,6,8))
#Make the graph
ggplot(mtcars, aes(Percent_Picked, color = cyl)) +
stat_ecdf(size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent)
Which creates this plot
I want to create a table for the value of each of the cylinder types when the overall Percent_Picked is at 25%, 50%, and 75%. So something that shows that 4-cylander is at 0%, 6 is around 28%, and 8 is around 85%.
Calculating quantiles by group doesn't give me what I want (it shows the percent of all cylinders picked when 25%, 50%, and 75% of the particular cylinder type was picked). (For example, the suggestions by tbradley1013 on their blog only help with quantiles for each particular cylinder, not the overall cdf for each cylinder at given quantiles for Percent_Picked.)
Any leads would be appreciated!

So looking around I found this question. Yours extends this a little by asking for group specific ecdf values, so we can use the do function in dplyr (here's an example] to do so. There's some slight differences in the values when comparing between this table and the values in your ggplot and I'm not exactly sure why that is. It could be just that the mtcars data set is somewhat small, so if you run this on a larger data set, I'd expect it to be closer to the actual values.
#Sort by mpg
mtcars <- mtcars[order(mtcars$mpg),]
#Make arbitrary ranking variable based on mpg
mtcars <- mtcars %>% mutate(Rank = dense_rank(mpg))
#Make variable for percent picked
mtcars <- mutate(mtcars, Percent_Picked = Rank/max(mtcars$Rank))
#Make cyl categorical
mtcars$cyl<-cut(mtcars$cyl, c(3,5,7,9), right=FALSE, labels=c(4,6,8))
#Make the graph
ggplot(mtcars, aes(Percent_Picked, color = cyl)) +
stat_ecdf(size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent)
create_ecdf_vals <- function(vec){
df <- data.frame(
x = unique(vec),
y = ecdf(vec)(unique(vec))*length(vec)
) %>%
mutate(y = scale(y, center = min(y), scale = diff(range(y)))) %>%
union_all(data.frame(x=c(0,1),
y=c(0,1))) # adding in max/mins
return(df)
}
mt.ecdf <- mtcars %>%
group_by(cyl) %>%
do(create_ecdf_vals(.$Percent_Picked))
mt.ecdf %>%
summarise(q25 = y[which.max(x[x<=0.25])],
q50 = y[which.max(x[x<=0.5])],
q75 = y[which.max(x[x<=0.75])])
ggplot(mt.ecdf,aes(x,y,color = cyl)) +
geom_step()
~EDIT~
After some digging around in the ggplot2 docs, we can actually explicitly pull out the data from the plot using the layer_data function.
my.plt <- ggplot(mtcars, aes(Percent_Picked, color = cyl)) +
stat_ecdf(size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent)
plt.data <- layer_data(my.plt) # magic happens here
# and here's the table you want
plt.data %>%
group_by(group) %>%
summarise(q25 = y[which.max(x[x<=0.25])],
q50 = y[which.max(x[x<=0.5])],
q75 = y[which.max(x[x<=0.75])])

A much shorter answer that I can't believe I didn't see earlier. Essentially I just divide the number of rows equal to or less than .25, .5, and .75 by the total number of rows, for each cyl.
cyl.table<-mtcars %>%
group_by(cyl) %>%
summarise("25% Picked" = sum(Percent_Picked<=0.25)/(sum(Percent_Picked<=1)),
"50% Picked" = sum(Percent_Picked<=0.5)/(sum(Percent_Picked<=1)),
"75% Picked" = sum(Percent_Picked<=0.75)/(sum(Percent_Picked<=1)))
cyl.table

Related

Filling bar colours with the mean of another continuous variable in ggplot2 histograms

I have a dataset at the municipality level. I would like to draw a histogram of a given variable and, at the same time, fill the bars with another continuous variable (using a color gradient). This is because I believe the municipalities with low values of the variable I am plotting the histogram for have very different population size (on average) when comparing with the municipalities that are in the upper end of the distribution.
Using the mtcar data, say I would like to plot the distribution of mpg and fill the bars with a continuous color to represent the mean of the variable wt for each of the histogram bars. I typed the code below but I don't know how to actually make the fill option take the average of wt. I would want a legend to show up with a color gradient so as to inform if the mean value of wt for each histogram bar is low-medium-high in relative terms.
mtcars %>%
ggplot(aes(x=mpg, fill=wt)) +
geom_histogram()
If you want a genuine histogram you need to transform your data to do this by summarizing it first, and plot with geom_col rather than geom_histogram. The base R function hist will help you here to generate the breaks and midpoints:
library(ggplot2)
library(dplyr)
mtcars %>%
mutate(mpg = cut(x = mpg,
breaks = hist(mpg, breaks = 0:4 * 10, plot = FALSE)$breaks,
labels = hist(mpg, breaks = 0:4 * 10, plot = FALSE)$mids)) %>%
group_by(mpg) %>%
summarize(n = n(), wt = mean(wt)) %>%
ggplot(aes(x = as.numeric(as.character(mpg)), y = n, fill = wt)) +
scale_x_continuous(limits = c(0, 40), name = "mpg") +
geom_col(width = 10) +
theme_bw()
It is not a histogram exactly, but was the closest that I could think for your problem
library(tidyverse)
mtcars %>%
#Create breaks for mpg, where this sequence is just an example
mutate(mpg_cut = cut(mpg,seq(10,35,5))) %>%
#Count and mean of wt by mpg_cut
group_by(mpg_cut) %>%
summarise(
n = n(),
wt = mean(wt)
) %>%
ggplot(aes(x=mpg_cut, fill=wt)) +
#Bar plot
geom_col(aes(y = n), width = 1)

Trying to filter rows by intervals and plotting number of rows obtained

Consider the column "disp" in mtcars. I am trying to divide disp into intervals so that I can count the number of observations in each interval. After doing this I want to plot the results as a ggplot geom_line
This is what I have tried:
library (tidyverse)
library (ggplot2)
a1 <- mtcars %>% arrange(desc(disp)) %>%
mutate(counts = cut_interval(disp, length = 5)) %>% group_by(counts) %>% mutate(nn = n())
a2 <- a1 %>% select(counts,nn) %>% unique()
ggplot(a2, aes(counts, nn)) +
geom_point(shape = 16, size = 1, show.legend = FALSE) +
theme_bw()
I get the intervals I need in a2. i can use it to plot a scatterplot but I can see that there is no proper scale. Is there any way to use these intervals to get a continuous scale and draw a lineplot of counts vs nn?
mtcars %>% ggplot(aes(x = disp)) + geom_histogram(binwidth = 1) + theme_bw()
Thanks so much Rui Barradas! I just needed a count plot so no need of doing extra stuff.

How to set aside certain numeric values of x with ggplot?

I have a continuous scale including some values which codify different categories of missing (for example 998,999), and I want to make a plot excluding these numeric missing values.
Since the values are together, I can use xlim each time, but since it determines the domain of the plot I have to change the values for each different case.
Then, I ask for a solution. I think in two possibilities.
Is it possible to put non-determining limits to the x-values? I mean, if I give 990 as a maximum limit, but the maximum value that appears is 100, the plot should show an x-range till approximately 100, not 990, as xlim does.
Is there an opposite function to xlim?, meaning that the range determined by the limits (or a discrete set of values given) won't be included in the x-axis.
Thanks in advance.
I think the simplest way is to exclude these values in the plot, either before or during the ggplot call.
MWE
library(tidyverse)
# Create data with overflowing data
mtcars2 <- mtcars
mtcars2[5:15, 'mpg'] <- 998
# Full plot
mtcars2 %>% ggplot() +
geom_point(aes(x = mpg, y = disp))
Filtering before plot
mtcars2 %>%
filter(mpg < 250) %>%
ggplot() +
geom_point(aes(x = mpg, y = disp))
Filtering during plot
mtcars2 %>%
ggplot() +
geom_point(aes(x = mpg, y = disp), data = . %>% filter(mpg < 250))
I would filter those missing values from the original dataset:
library(dplyr)
df <- data.frame(cat = rep(LETTERS[1:4], 3),
values = sample(10, 12, replace = TRUE)
)
# Add missing values
df$values[c(1,5,10)] <- 999
df$values[c(2,7)] <- 998
invalid_values <- c(998, 999)
library(ggplot2)
df %>%
filter(!values %in% invalid_values) %>%
ggplot() +
geom_point(aes(cat, values))
Alternatively, if that's not possible for some reason, you can define a scale transformation:
df %>%
ggplot() +
geom_point(aes(cat, values)) +
scale_y_continuous(trans = scales::trans_new('remove_invalid',
transform = function(d) {d <- if_else(d %in% invalid_values, NA_real_, d)},
inverse = function(d) {if_else(is.na(d), 999, d)}
)
)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 5 rows containing missing values (geom_point).
Created on 2018-05-09 by the reprex package (v0.2.0).

Label only the outliers in plotly boxplot [duplicate]

I have the code that creates a boxplot, using ggplot in R, I want to label my outliers with the year and Battle.
Here is my code to create my boxplot
require(ggplot2)
ggplot(seabattle, aes(x=PortugesOutcome,y=RatioPort2Dutch ),xlim="OutCome",
y="Ratio of Portuguese to Dutch/British ships") +
geom_boxplot(outlier.size=2,outlier.colour="green") +
stat_summary(fun.y="mean", geom = "point", shape=23, size =3, fill="pink") +
ggtitle("Portugese Sea Battles")
Can anyone help? I knew this is correct, I just want to label the outliers.
The following is a reproducible solution that uses dplyr and the built-in mtcars dataset.
Walking through the code: First, create a function, is_outlier that will return a boolean TRUE/FALSE if the value passed to it is an outlier. We then perform the "analysis/checking" and plot the data -- first we group_by our variable (cyl in this example, in your example, this would be PortugesOutcome) and we add a variable outlier in the call to mutate (if the drat variable is an outlier [note this corresponds to RatioPort2Dutch in your example], we will pass the drat value, otherwise we will return NA so that value is not plotted). Finally, we plot the results and plot the text values via geom_text and an aesthetic label equal to our new variable; in addition, we offset the text (slide it a bit to the right) with hjust so that we can see the values next to, rather than on top of, the outlier points.
library(dplyr)
library(ggplot2)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
mtcars %>%
group_by(cyl) %>%
mutate(outlier = ifelse(is_outlier(drat), drat, as.numeric(NA))) %>%
ggplot(., aes(x = factor(cyl), y = drat)) +
geom_boxplot() +
geom_text(aes(label = outlier), na.rm = TRUE, hjust = -0.3)
You can do this simply within ggplot itself, using an appropriate stat_summary call.
ggplot(mtcars, aes(x = factor(cyl), y = drat, fill = factor(cyl))) +
geom_boxplot() +
stat_summary(
aes(label = round(stat(y), 1)),
geom = "text",
fun.y = function(y) { o <- boxplot.stats(y)$out; if(length(o) == 0) NA else o },
hjust = -1
)
To label the outliers with rownames (based on JasonAizkalns answer)
library(dplyr)
library(ggplot2)
library(tibble)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
dat <- mtcars %>% tibble::rownames_to_column(var="outlier") %>% group_by(cyl) %>% mutate(is_outlier=ifelse(is_outlier(drat), drat, as.numeric(NA)))
dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
ggplot(dat, aes(y=drat, x=factor(cyl))) + geom_boxplot() + geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05)
Does this work for you?
library(ggplot2)
library(data.table)
#generate some data
set.seed(123)
n=500
dat <- data.table(group=c("A","B"),value=rnorm(n))
ggplot defines an outlier by default as something that's > 1.5*IQR from the borders of the box.
#function that takes in vector of data and a coefficient,
#returns boolean vector if a certain point is an outlier or not
check_outlier <- function(v, coef=1.5){
quantiles <- quantile(v,probs=c(0.25,0.75))
IQR <- quantiles[2]-quantiles[1]
res <- v < (quantiles[1]-coef*IQR)|v > (quantiles[2]+coef*IQR)
return(res)
}
#apply this to our data
dat[,outlier:=check_outlier(value),by=group]
dat[,label:=ifelse(outlier,"label","")]
#plot
ggplot(dat,aes(x=group,y=value))+geom_boxplot()+geom_text(aes(label=label),hjust=-0.3)
Similar answer to above, but gets outliers directly from ggplot2, thus avoiding any potential conflict in method:
# calculate boxplot object
g <- ggplot(mtcars, aes(factor(cyl), drat)) + geom_boxplot()
# get list of outliers
out <- ggplot_build(g)[["data"]][[1]][["outliers"]]
# label list elements with factor levels
names(out) <- levels(factor(mtcars$cyl))
# convert to tidy data
tidyout <- purrr::map_df(out, tibble::as_tibble, .id = "cyl")
# plot boxplots with labels
g + geom_text(data = tidyout, aes(cyl, value, label = value),
hjust = -.3)
With a small twist on #JasonAizkalns solution you can label outliers with their location in your data frame.
mtcars[,'row'] <- row(mtcars)[,1]
...
mutate(outlier = ifelse(is_outlier(drat), row, as.numeric(NA)))
...
I load the data frame into the R Studio Environment, so I can then take a closer look at the data in outlier rows.

Labeling Outliers of Boxplots in R

I have the code that creates a boxplot, using ggplot in R, I want to label my outliers with the year and Battle.
Here is my code to create my boxplot
require(ggplot2)
ggplot(seabattle, aes(x=PortugesOutcome,y=RatioPort2Dutch ),xlim="OutCome",
y="Ratio of Portuguese to Dutch/British ships") +
geom_boxplot(outlier.size=2,outlier.colour="green") +
stat_summary(fun.y="mean", geom = "point", shape=23, size =3, fill="pink") +
ggtitle("Portugese Sea Battles")
Can anyone help? I knew this is correct, I just want to label the outliers.
The following is a reproducible solution that uses dplyr and the built-in mtcars dataset.
Walking through the code: First, create a function, is_outlier that will return a boolean TRUE/FALSE if the value passed to it is an outlier. We then perform the "analysis/checking" and plot the data -- first we group_by our variable (cyl in this example, in your example, this would be PortugesOutcome) and we add a variable outlier in the call to mutate (if the drat variable is an outlier [note this corresponds to RatioPort2Dutch in your example], we will pass the drat value, otherwise we will return NA so that value is not plotted). Finally, we plot the results and plot the text values via geom_text and an aesthetic label equal to our new variable; in addition, we offset the text (slide it a bit to the right) with hjust so that we can see the values next to, rather than on top of, the outlier points.
library(dplyr)
library(ggplot2)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
mtcars %>%
group_by(cyl) %>%
mutate(outlier = ifelse(is_outlier(drat), drat, as.numeric(NA))) %>%
ggplot(., aes(x = factor(cyl), y = drat)) +
geom_boxplot() +
geom_text(aes(label = outlier), na.rm = TRUE, hjust = -0.3)
You can do this simply within ggplot itself, using an appropriate stat_summary call.
ggplot(mtcars, aes(x = factor(cyl), y = drat, fill = factor(cyl))) +
geom_boxplot() +
stat_summary(
aes(label = round(stat(y), 1)),
geom = "text",
fun.y = function(y) { o <- boxplot.stats(y)$out; if(length(o) == 0) NA else o },
hjust = -1
)
To label the outliers with rownames (based on JasonAizkalns answer)
library(dplyr)
library(ggplot2)
library(tibble)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
dat <- mtcars %>% tibble::rownames_to_column(var="outlier") %>% group_by(cyl) %>% mutate(is_outlier=ifelse(is_outlier(drat), drat, as.numeric(NA)))
dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
ggplot(dat, aes(y=drat, x=factor(cyl))) + geom_boxplot() + geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05)
Does this work for you?
library(ggplot2)
library(data.table)
#generate some data
set.seed(123)
n=500
dat <- data.table(group=c("A","B"),value=rnorm(n))
ggplot defines an outlier by default as something that's > 1.5*IQR from the borders of the box.
#function that takes in vector of data and a coefficient,
#returns boolean vector if a certain point is an outlier or not
check_outlier <- function(v, coef=1.5){
quantiles <- quantile(v,probs=c(0.25,0.75))
IQR <- quantiles[2]-quantiles[1]
res <- v < (quantiles[1]-coef*IQR)|v > (quantiles[2]+coef*IQR)
return(res)
}
#apply this to our data
dat[,outlier:=check_outlier(value),by=group]
dat[,label:=ifelse(outlier,"label","")]
#plot
ggplot(dat,aes(x=group,y=value))+geom_boxplot()+geom_text(aes(label=label),hjust=-0.3)
Similar answer to above, but gets outliers directly from ggplot2, thus avoiding any potential conflict in method:
# calculate boxplot object
g <- ggplot(mtcars, aes(factor(cyl), drat)) + geom_boxplot()
# get list of outliers
out <- ggplot_build(g)[["data"]][[1]][["outliers"]]
# label list elements with factor levels
names(out) <- levels(factor(mtcars$cyl))
# convert to tidy data
tidyout <- purrr::map_df(out, tibble::as_tibble, .id = "cyl")
# plot boxplots with labels
g + geom_text(data = tidyout, aes(cyl, value, label = value),
hjust = -.3)
With a small twist on #JasonAizkalns solution you can label outliers with their location in your data frame.
mtcars[,'row'] <- row(mtcars)[,1]
...
mutate(outlier = ifelse(is_outlier(drat), row, as.numeric(NA)))
...
I load the data frame into the R Studio Environment, so I can then take a closer look at the data in outlier rows.

Resources