I have this code to create two subset columns based on quantiles, one column for median split and one column for quartile split.
mtcars <- subset(mtcars, select = c("cyl", "disp"))
mtcars$median_split <- ifelse(mtcars$disp <= median(mtcars$disp), "below_median","above_median")
mtcars$quantile_split <- cut(mtcars$disp, breaks = c(0, quantile(mtcars$disp)),labels = c("1_quartile",paste0(1:4, "_quartile")))
This works nicely for the whole dataset, but how can I do this for each cyl separately, please?
So, I am hoping to print the median/quartile split labels based on disp values within each cyl group. Thank you.
This can be accomplished using the dplyr package:
library(dplyr)
mtcars %>%
select(cyl, disp) %>%
group_by(cyl) %>%
mutate(median_split = ifelse(disp <=median(disp), "below_median","above_median"),
quartile_split = cut(disp, breaks = c(0, quantile(disp)), labels = c("1_quartile",paste0(1:4, "_quartile")))) %>%
arrange(cyl)
This code groups the data by the cyl column and then computes the median_split and quartile_split based on the disp values within each cyl group.
Related
I have multiple graphs I'm generating with a data set. I preform many operations on the data (filtering rows, aggregating rows, calculations over columns, etc.) before passing on the result to ggplot(). I want to access the data I passed on to ggplot() in subsequent ggplot layers and facets so I can have more control over the resulting graph and to include some characteristics of the data in the plot itself, like for example the number of observations.
Here is a reproducible example:
library(tidyverse)
cars <- mtcars
# Normal scatter plot
cars %>%
filter(
# Many complicated operations
) %>%
group_by(
# More complicated operations
across()
) %>%
summarise(
# Even more complicated operations
n = n()
) %>%
ggplot(aes(x = mpg, y = qsec)) +
geom_point() +
# Join the dots but only if mpg < 20
geom_line(data = .data %>% filter(mpg < 20)) +
# Include the total number of observations in the graph
labs(caption = paste("N. obs =", NROW(.data)))
one could of course create a a separate data set before passing that onto ggplot and then reference that data set throughout (as in the example bellow). However, this is much more cumbersome as you need to save (and later remove) a data set for each graph and run two separate commands for just one graph.
I want to know if there is something that can be done that's more akin to the first example using .data (which obviously doesn't actually work).
library(tidyverse)
cars <- mtcars
tmp <- cars %>%
filter(
# Many complicated operations
) %>%
group_by(
# More complicated operations
across()
) %>%
summarise(
# Even more complicated operations
n = n()
)
tmp %>%
ggplot(aes(x = mpg, y = qsec)) +
geom_point() +
# Join the dots but only if mpg < 20
geom_line(data = tmp %>% filter(mpg < 20)) +
# Include the total number of observations in the graph
labs(caption = paste("N. obs =", NROW(tmp)))
Thanks for your help!
In the help page for each geom_ it helpfully gives a standard way:
A function will be called with a single argument, the plot data. The return value must be a data.frame, and will be used as the layer data. A function can be created from a formula (e.g. ~ head(.x, 10)).
For labs on the other hand you can use the . placeholders in piping, but you have to a) give the . as the data argument in the first place and b) wrap the whole thing in curly braces to recognise the later ..
So for example:
library(tidyverse)
cars <- mtcars
# Normal scatter plot
cars %>%
filter() %>%
group_by(across()) %>%
summarise(n = n()) %>%
{
ggplot(., aes(x = mpg, y = qsec)) +
geom_point() +
geom_line(data = ~ filter(.x, mpg < 20)) +
labs(caption = paste("N. obs =", NROW(.)))
}
Or if you don't like the purrr formula syntax, then the flashy new R anonymous functions work too:
geom_line(data = \(x) filter(x, mpg < 20)) +
Unfortunately the labs function doesn't seem to have an explicit way of testing whether data is shuffling invisibly through the ggplot stack as by-and-large it usually can get on with its job without touching the main data. These are some ways around this.
#Preparing the data and loading packages
library(modelsummary);library(tidyverse);library(gt)
as_tibble(mtcars)
df <- mtcars %>% mutate(cyl_ = factor(cyl)) %>%
dplyr::select(cyl_, mpg, vs, am, hp, wt)
#Gets table of descriptive statistics about different subsets of the data
print(t1 <- datasummary_balance(~cyl_,
data = df,
output = "gt"))
#This hides the "Std. Dev." columns
t1 %>% cols_hide(c(3,5,7))
#Now I want to hide the "Mean" column labels, but I want to keep the "cyl_" value column labels. Any ideas how?
I want something like this:
Using the gt package, you can pipe your table to tab_options(column_labels.hidden = TRUE) to remove column labels. Unfortunately, this will remove both levels: the column headers, and the spanning labels that include the cyl info you want to keep.
Note that datasummary_balance() produces a highly customized table which is intended to be used as a ready-made output. In cases like these, it might be easier to just build the custom table you want using datasummary() instead of trying to customize datasummary_balance() (square peg, round hole, etc). For example:
library(modelsummary)
library(tidyverse)
df <- mtcars %>%
select(cyl, mpg, vs, am, hp, wt) %>%
mutate(cyl = factor(sprintf("%s (N = %s)", cyl, n()))) %>%
as.data.frame() # The `All()` function does not accept tibbles
datasummary(
All(df) ~ Mean * cyl,
data = df,
output = "gt")
Within a custom function, how can I avoid repeating the same code for each group while allowing an unknown number of groups?
Here's a simpler example but assume the function has tons of operations, like calculating different statistics for each group and sticking them on each ggplot facet. Sorry, I find it difficult to make a simpler function to demonstrate this specific challenge.
test.function <- function(variable, group, data) {
if(!require(dplyr)){install.packages("dplyr")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(ggrepel)){install.packages("ggrepel")}
library(dplyr)
library(ggplot2)
require(ggrepel)
data$variable <- data[,variable]
data$group <- factor(data[,group])
# Compute individual group stats
data %>%
filter(data$group==levels(data$group)[1]) %>%
select(variable) %>%
unlist %>%
shapiro.test() -> shap
shapiro.1 <- round(shap$p.value,3)
data %>%
filter(data$group==levels(data$group)[2]) %>%
select(variable) %>%
unlist %>%
shapiro.test() -> shap
shapiro.2 <- round(shap$p.value,3)
data %>%
filter(data$group==levels(data$group)[3]) %>%
select(variable) %>%
unlist %>%
shapiro.test() -> shap
shapiro.3 <- round(shap$p.value,3)
# Make the stats dataframe for ggplot
dat_text <- data.frame(
group = levels(data$group),
text = c(shapiro.1, shapiro.2, shapiro.3))
# Make the plot
ggplot(data, aes(x=variable, fill=group)) +
geom_density() +
facet_grid(group ~ .) +
geom_text_repel(data = dat_text,
mapping = aes(x = Inf,
y = Inf,
label = text))
}
Works if there's three groups
test.function("mpg", "cyl", mtcars)
Doesn't work if there's two groups
test.function("mpg", "vs", mtcars)
Error in shapiro.test(.) : sample size must be between 3 and 5000
Doesn't work if there's more than three groups
test <- mtcars %>% mutate(new = rep(1:4, 8))
test.function("mpg", "new", test)
Error in data.frame(group = levels(data$group), text = c(shapiro.1, shapiro.2, :
arguments imply differing number of rows: 4, 3
What is the trick programmers usually use to accommodate any number of groups in such functions?
I was asked in the comments to explain the thinking here, so I thought I would expand on the original answer, which shows up below the horizontal rule below.
The main question is how to do some operation on an unknown number of groups. There are lots of different ways to do that. In any of the ways, you need the function to be able to identify the number of groups and adapt to that number. For example, you could do something like the code below. There, I identify the unique groups in the data, initialize the required result and then loop over all of the groups. I didn't use this strategy because the for loop feels a bit clunky compared to the dplyr code.
un_group <- na.omit(unique(data[[group]]))
dat_text <- data.frame(group = un_group,
text = NA)
for(i in 1:length(un_group)){
tmp <- data[which(data[[group]] == ungroup[i]), ]
dat_text$text[i] <- as.character(round(shaprio.test(tmp[[variable]])$p.value, 3))
}
The other thing to keep in mind is what's going to scale well. You mentioned that you've got lots of operations the code will ultimately do. In what's below, I just had summarise print a single number. However, you could write a little function that would produce a dataset and then summarise can return a number of results. For example, consider:
myfun <- function(x){
s = shapiro.test(x)
data.frame(p = s$p.value, stat=s$statistic,
mean = mean(x, na.rm=TRUE),
sd = sd(x, na.rm=TRUE),
skew = DescTools::Skew(x, na.rm=TRUE),
kurtosis = DescTools::Kurt(x, na.rm=TRUE))
}
mtcars %>% group_by(cyl) %>% summarise(myfun(mpg))
# # A tibble: 3 x 7
# cyl p stat mean sd skew kurtosis
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 0.261 0.912 26.7 4.51 0.259 -1.65
# 2 6 0.325 0.899 19.7 1.45 -0.158 -1.91
# 3 8 0.323 0.932 15.1 2.56 -0.363 -0.566
In the function above, I had the function return a data frame with several different variables. A single call to summarise returns all of those results for the variable for each group. This would certainly have been possible using a for loop or something like sapply(), but I like how the dplyr code reads a bit better. And, depending on how many groups you have, the dplyr code scales a bit better than some of the base R stuff.
I really like trying to reflect the inputs (i.e., input variable names) in the outputs - so I wanted to find a way to get around making variables called group and variable in the data. The aes_string() specification is one way of doing that and then building a formula using the variable names is another. I recently just encountered the reformulate() function, which is a more robust way of building formulae than the combination of paste() and as.formula() I was using before.
Those were the things I was thinking about when I was answering the question.
test.function <- function(variable, group, data) {
if(!require(dplyr)){install.packages("dplyr")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(ggrepel)){install.packages("ggrepel")}
library(dplyr)
library(ggplot2)
require(ggrepel)
# Compute individual group stats
data[[group]] <- as.factor(data[[group]])
dat_text <- data %>% group_by(.data[[group]]) %>%
summarise(text=shapiro.test(.data[[variable]])$p.value) %>%
mutate(text=as.character(round(text, 3)))
gform <- reformulate(".", response=group)
# Make the plot
ggplot(data, aes_string(x=variable, fill=group)) +
geom_density() +
facet_grid(gform) +
geom_text_repel(data = dat_text,
mapping = aes(x = Inf,
y = Inf,
label = text))
}
test.function("mpg", "vs", mtcars)
test.function("mpg", "cyl", mtcars)
I am using the mtcars dataset as an example and I use this code.
library(ggplot2)
library(ggsci)
ggviolin(mtcars, x="cyl", y="disp", fill="cyl", palette="jco", facet.by = "am")
To each facet, I would like to add a fourth category on the x-axis (maybe call this "6or8"), in which the 6- and 8-cylinder groups (but not the 4-cylinder group) are combined. I found this similar post, but it did not help me, because of my facets and addition of two instead of all categories.
Does anyone have a suggestion? Thank you.
You could try this:
> newmtcars <- rbind(mtcars %>% mutate(cyl = as.character(cyl)),
+ mtcars %>% filter(cyl %in% c(6,8)) %>% mutate(cyl = '6or8')) %>% arrange(cyl)
> ggviolin(newmtcars, x="cyl", y="disp", fill="cyl", palette="jco", facet.by = "am")
You can manually change the levels for cyl to change the ordering in the plot (if, for example, you want "6or8" to be the first/last level).
One really cool feature from the ggplot2 package that I never really exploited enough was adding lists of layers to a plot. The fun thing about this was that I could pass a list of layers as an argument to a function and have them added to the plot. I could then get the desired appearance of the plot without necessarily returning the plot from the function (whether or not this is a good idea is another matter, but it was possible).
library(ggplot2)
x <- ggplot(mtcars,
aes(x = qsec,
y = mpg))
layers <- list(geom_point(),
geom_line(),
xlab("Quarter Mile Time"),
ylab("Fuel Efficiency"))
x + layers
Is there a way to do this with pipes? Something akin to:
#* Obviously isn't going to work
library(dplyr)
action <- list(group_by(am, gear),
summarise(mean = mean(mpg),
sd = sd(mpg)))
mtcars %>% action
To construct a sequence of magrittr steps, start with .
action = . %>% group_by(am, gear) %>% summarise(mean = mean(mpg), sd = sd(mpg))
Then it can be used as imagined in the OP:
mtcars %>% action
Like a list, we can subset to see each step:
action[[1]]
# function (.)
# group_by(., am, gear)
To review all steps, use functions(action) or just type the name:
action
# Functional sequence with the following components:
#
# 1. group_by(., am, gear)
# 2. summarise(., mean = mean(mpg), sd = sd(mpg))
#
# Use 'functions' to extract the individual functions.