Custom function: allow unknown number of groups for operations - r

Within a custom function, how can I avoid repeating the same code for each group while allowing an unknown number of groups?
Here's a simpler example but assume the function has tons of operations, like calculating different statistics for each group and sticking them on each ggplot facet. Sorry, I find it difficult to make a simpler function to demonstrate this specific challenge.
test.function <- function(variable, group, data) {
if(!require(dplyr)){install.packages("dplyr")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(ggrepel)){install.packages("ggrepel")}
library(dplyr)
library(ggplot2)
require(ggrepel)
data$variable <- data[,variable]
data$group <- factor(data[,group])
# Compute individual group stats
data %>%
filter(data$group==levels(data$group)[1]) %>%
select(variable) %>%
unlist %>%
shapiro.test() -> shap
shapiro.1 <- round(shap$p.value,3)
data %>%
filter(data$group==levels(data$group)[2]) %>%
select(variable) %>%
unlist %>%
shapiro.test() -> shap
shapiro.2 <- round(shap$p.value,3)
data %>%
filter(data$group==levels(data$group)[3]) %>%
select(variable) %>%
unlist %>%
shapiro.test() -> shap
shapiro.3 <- round(shap$p.value,3)
# Make the stats dataframe for ggplot
dat_text <- data.frame(
group = levels(data$group),
text = c(shapiro.1, shapiro.2, shapiro.3))
# Make the plot
ggplot(data, aes(x=variable, fill=group)) +
geom_density() +
facet_grid(group ~ .) +
geom_text_repel(data = dat_text,
mapping = aes(x = Inf,
y = Inf,
label = text))
}
Works if there's three groups
test.function("mpg", "cyl", mtcars)
Doesn't work if there's two groups
test.function("mpg", "vs", mtcars)
Error in shapiro.test(.) : sample size must be between 3 and 5000
Doesn't work if there's more than three groups
test <- mtcars %>% mutate(new = rep(1:4, 8))
test.function("mpg", "new", test)
Error in data.frame(group = levels(data$group), text = c(shapiro.1, shapiro.2, :
arguments imply differing number of rows: 4, 3
What is the trick programmers usually use to accommodate any number of groups in such functions?

I was asked in the comments to explain the thinking here, so I thought I would expand on the original answer, which shows up below the horizontal rule below.
The main question is how to do some operation on an unknown number of groups. There are lots of different ways to do that. In any of the ways, you need the function to be able to identify the number of groups and adapt to that number. For example, you could do something like the code below. There, I identify the unique groups in the data, initialize the required result and then loop over all of the groups. I didn't use this strategy because the for loop feels a bit clunky compared to the dplyr code.
un_group <- na.omit(unique(data[[group]]))
dat_text <- data.frame(group = un_group,
text = NA)
for(i in 1:length(un_group)){
tmp <- data[which(data[[group]] == ungroup[i]), ]
dat_text$text[i] <- as.character(round(shaprio.test(tmp[[variable]])$p.value, 3))
}
The other thing to keep in mind is what's going to scale well. You mentioned that you've got lots of operations the code will ultimately do. In what's below, I just had summarise print a single number. However, you could write a little function that would produce a dataset and then summarise can return a number of results. For example, consider:
myfun <- function(x){
s = shapiro.test(x)
data.frame(p = s$p.value, stat=s$statistic,
mean = mean(x, na.rm=TRUE),
sd = sd(x, na.rm=TRUE),
skew = DescTools::Skew(x, na.rm=TRUE),
kurtosis = DescTools::Kurt(x, na.rm=TRUE))
}
mtcars %>% group_by(cyl) %>% summarise(myfun(mpg))
# # A tibble: 3 x 7
# cyl p stat mean sd skew kurtosis
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 0.261 0.912 26.7 4.51 0.259 -1.65
# 2 6 0.325 0.899 19.7 1.45 -0.158 -1.91
# 3 8 0.323 0.932 15.1 2.56 -0.363 -0.566
In the function above, I had the function return a data frame with several different variables. A single call to summarise returns all of those results for the variable for each group. This would certainly have been possible using a for loop or something like sapply(), but I like how the dplyr code reads a bit better. And, depending on how many groups you have, the dplyr code scales a bit better than some of the base R stuff.
I really like trying to reflect the inputs (i.e., input variable names) in the outputs - so I wanted to find a way to get around making variables called group and variable in the data. The aes_string() specification is one way of doing that and then building a formula using the variable names is another. I recently just encountered the reformulate() function, which is a more robust way of building formulae than the combination of paste() and as.formula() I was using before.
Those were the things I was thinking about when I was answering the question.
test.function <- function(variable, group, data) {
if(!require(dplyr)){install.packages("dplyr")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(ggrepel)){install.packages("ggrepel")}
library(dplyr)
library(ggplot2)
require(ggrepel)
# Compute individual group stats
data[[group]] <- as.factor(data[[group]])
dat_text <- data %>% group_by(.data[[group]]) %>%
summarise(text=shapiro.test(.data[[variable]])$p.value) %>%
mutate(text=as.character(round(text, 3)))
gform <- reformulate(".", response=group)
# Make the plot
ggplot(data, aes_string(x=variable, fill=group)) +
geom_density() +
facet_grid(gform) +
geom_text_repel(data = dat_text,
mapping = aes(x = Inf,
y = Inf,
label = text))
}
test.function("mpg", "vs", mtcars)
test.function("mpg", "cyl", mtcars)

Related

ggplot2: Can you acess the .data argument in subsequent layers?

I have multiple graphs I'm generating with a data set. I preform many operations on the data (filtering rows, aggregating rows, calculations over columns, etc.) before passing on the result to ggplot(). I want to access the data I passed on to ggplot() in subsequent ggplot layers and facets so I can have more control over the resulting graph and to include some characteristics of the data in the plot itself, like for example the number of observations.
Here is a reproducible example:
library(tidyverse)
cars <- mtcars
# Normal scatter plot
cars %>%
filter(
# Many complicated operations
) %>%
group_by(
# More complicated operations
across()
) %>%
summarise(
# Even more complicated operations
n = n()
) %>%
ggplot(aes(x = mpg, y = qsec)) +
geom_point() +
# Join the dots but only if mpg < 20
geom_line(data = .data %>% filter(mpg < 20)) +
# Include the total number of observations in the graph
labs(caption = paste("N. obs =", NROW(.data)))
one could of course create a a separate data set before passing that onto ggplot and then reference that data set throughout (as in the example bellow). However, this is much more cumbersome as you need to save (and later remove) a data set for each graph and run two separate commands for just one graph.
I want to know if there is something that can be done that's more akin to the first example using .data (which obviously doesn't actually work).
library(tidyverse)
cars <- mtcars
tmp <- cars %>%
filter(
# Many complicated operations
) %>%
group_by(
# More complicated operations
across()
) %>%
summarise(
# Even more complicated operations
n = n()
)
tmp %>%
ggplot(aes(x = mpg, y = qsec)) +
geom_point() +
# Join the dots but only if mpg < 20
geom_line(data = tmp %>% filter(mpg < 20)) +
# Include the total number of observations in the graph
labs(caption = paste("N. obs =", NROW(tmp)))
Thanks for your help!
In the help page for each geom_ it helpfully gives a standard way:
A function will be called with a single argument, the plot data. The return value must be a data.frame, and will be used as the layer data. A function can be created from a formula (e.g. ~ head(.x, 10)).
For labs on the other hand you can use the . placeholders in piping, but you have to a) give the . as the data argument in the first place and b) wrap the whole thing in curly braces to recognise the later ..
So for example:
library(tidyverse)
cars <- mtcars
# Normal scatter plot
cars %>%
filter() %>%
group_by(across()) %>%
summarise(n = n()) %>%
{
ggplot(., aes(x = mpg, y = qsec)) +
geom_point() +
geom_line(data = ~ filter(.x, mpg < 20)) +
labs(caption = paste("N. obs =", NROW(.)))
}
Or if you don't like the purrr formula syntax, then the flashy new R anonymous functions work too:
geom_line(data = \(x) filter(x, mpg < 20)) +
Unfortunately the labs function doesn't seem to have an explicit way of testing whether data is shuffling invisibly through the ggplot stack as by-and-large it usually can get on with its job without touching the main data. These are some ways around this.

How to plot a large number of density plots with different categorical variables

I have a dataset in which I have one numeric variable and many categorical variables. I would like to make a grid of density plots, each showing the distribution of the numeric variable for different categorical variables, with the fill corresponding to subgroups of each categorical variable. For example:
library(tidyverse)
library(nycflights13)
dat <- flights %>%
select(carrier, origin, distance) %>%
mutate(origin = origin %>% as.factor,
carrier = carrier %>% as.factor)
plot_1 <- dat %>%
ggplot(aes(x = distance, fill = carrier)) +
geom_density()
plot_1
plot_2 <- dat %>%
ggplot(aes(x = distance, fill = origin)) +
geom_density()
plot_2
I would like to find a way to quickly make these two plots. Right now, the only way I know how to do this is to create each plot individually, and then use grid_arrange to put them together. However, my real dataset has something like 15 categorical variables, so this would be very time intensive!
Is there a quicker and easier way to do this? I believe that the hardest part about this is that each plot has its own legend, so I'm not sure how to get around that stumbling block.
This solutions gives all the plots in a list. Here we make a single function that accepts a variable that you want to plot, and then use lapply with a vector of all the variables you want to plot.
fill_variables <- vars(carrier, origin)
func_plot <- function(fill_variable) {
dat %>%
ggplot(aes(x = distance, fill = !!fill_variable)) +
geom_density()
}
plotlist <- lapply(fill_variables, func_plot)
If you have no idea of what those !! mean, I recommend watching this 5 minute video that introduces the key concepts of tidy evaluation. This is what you want to use when you want to create this sorts of wrapper functions to do stuff programmatically. I hope this helps!
Edit: If you want to feed an array of strings instead of a quosure, you can change !!fill_variable for !!sym(fill_variable) as follows:
fill_variables <- c('carrier', 'origin')
func_plot <- function(fill_variable) {
dat %>%
ggplot(aes(x = distance, fill = !!sym(fill_variable))) +
geom_density()
}
plotlist <- lapply(fill_variables, func_plot)
Alternative solution
As #djc wrote in the comments, I'm having trouble passing the column names into 'fill_variables'. Right now I am extracting column names using the following code...
You can separate the categorical and numerical variables like; cat_vars <- flights[, sapply(flights, is.character)] for categorical variables and cat_vars <- flights[, sapply(flights, !is.character)] for continuous variables and then pass these vectors into the wrapper function given by mgiormenti
Full code is given below;
library(tidyverse)
library(nycflights13)
cat_vars <- flights[, sapply(flights, is.character)]
cont_vars<- flights[, !sapply(flights, is.character)]
dat <- flights %>%
select(carrier, origin, distance) %>%
mutate(origin = origin %>% as.factor,
carrier = carrier %>% as.factor)
func_plot_cat <- function(cat_vars) {
dat %>%
ggplot(aes(x = distance, fill = !!cat_vars)) +
geom_density()
}
func_plot_cont <- function(cont_vars) {
dat %>%
ggplot(aes(x = distance, fill = !!cont_vars)) +
geom_point()
}
plotlist_cat_vars <- lapply(cat_vars, func_plot_cat)
plotlist_cont_vars<- lapply(cont_vars, func_plot_cont)
print(plotlist_cat_vars)
print(plotlist_cont_vars)

Show outliers in an efficient manner using ggplot

The actual data (and aim) I have is different but for reproducing purposes I used the Titanic dataset. My aim is create a plot of the age outliers (1 time SD) per class and sex.
Therefore the first thing I did is calculating the sd values and ranges:
library(dplyr)
library(ggplot2)
#Load titanic set
titanic <- read.csv("titanic_total.csv")
group <- group_by(titanic, Pclass, Sex)
#Create outlier ranges
summarise <- summarise(group, mean=mean(Age), sd=sd(Age))
summarise <- as.data.frame(summarise)
summarise$outlier_max <- summarise$mean + summarise$sd
summarise$outlier_min <- summarise$mean - summarise$sd
#Create a key
summarise$key <- paste0(summarise$Pclass, summarise$Sex)
#Create a key for the base set
titanic$key <- paste0(titanic$Pclass, titanic$Sex)
total_data <- left_join(titanic, summarise, by = "key")
total_data$outlier <- 0
Next, using a loop I determine whether the age is inside or outside the range
for (row in 1:nrow(total_data)){
if((total_data$Age[row]) > (total_data$outlier_max[row])){
total_data$outlier[row] <- 1
} else if ((total_data$Age[row]) < (total_data$outlier_min[row])){
total_data$outlier[row] <- 1
} else {
total_data$outlier[row] <- 0
}
}
Do some data cleaning ...
total_data$Pclass.x <- as.factor(total_data$Pclass.x)
total_data$outlier <- as.factor(total_data$outlier)
Now this code gives me the plot I am looking for.
ggplot(total_data, aes(x = Age, y = Pclass.x, colour = outlier)) + geom_point() +
facet_grid(. ~Sex.x)
However, this not really seems like the easiest way to crack this problem. Any thoughts on how I can include best practises to make this more efficients.
One way to reduce your code and make it less repetitive is to get it all into one procedure thanks to the pipe. Instead of creating a summary with the values, re-join this with the data, you could basically do this within one mutate step:
titanic %>%
mutate(Pclass = as.factor(Pclass)) %>%
group_by(Pclass, Sex) %>%
mutate(Age.mean = mean(Age),
Age.sd = sd(Age),
outlier.max = Age.mean + Age.sd,
outlier.min = Age.mean - Age.sd,
outlier = as.factor(ifelse(Age > outlier.max, 1,
ifelse(Age < outlier.min, 1, 0)))) %>%
ggplot() +
geom_point(aes(Age, Pclass, colour = outlier)) +
facet_grid(.~Sex)
Pclass is mutated to a factor in advance, as it is a grouping factor. Then, the steps are done within the original dataframe, instead of creating two new ones. No changes are made to the original dataset however! If you would want this, just reassign the results to titanic or another data frame, and execute the ggplot-part as next step. Else you would assign the result of the figure to your data.
For the identification of outliers, one way is to work with the ifelse. Alternatively, dplyr offers the nice between function, however, for this, you would need to add rowwise, i.e. after creating the min and max thresholds for outliers:
...
rowwise() %>%
mutate(outlier = as.factor(as.numeric(between(Age, outlier.min, outlier.max)))) %>% ...
Plus:
Additionally, you could even reduce your code further, depends on which variables you want to keep in which way:
titanic %>%
group_by(Pclass, Sex) %>%
mutate(outlier = as.factor(ifelse(Age > (mean(Age) + sd(Age)), 1,
ifelse(Age < (mean(Age) - sd(Age)), 1, 0)))) %>%
ggplot() +
geom_point(aes(Age, as.factor(Pclass), colour = outlier)) +
facet_grid(.~Sex)

Creating a scatter plot using two data sets in R

Beginner here. I'm hoping to create a scatterplot using two datasets that I created using group by:
menthlth_perc_bystate <- brfss2013 %>%
group_by(state) %>%
summarise(percent_instability = sum(menthlth > 15, na.rm = TRUE) / n()) %>%
arrange(desc(percent_instability))
exercise_perc_bystate <- brfss2013 %>%
group_by(state) %>%
summarise(perc_exercise = sum(exeroft1 > 30, na.rm = TRUE) / n()) %>%
arrange(desc(perc_exercise))
I want to merge these into one dataset, total_data. Both have 54 obs.
total_data <- merge(menthlth_perc_bystate,exercise_perc_bystate,by="state")
Presumably the scatter plot would take on one axis the state's percent instability (menthlth_perc_bystate) and on another the states percent exercise (exercise_perc_by_state). I tried this using ggplot but got an error:
ggplot(total_data, aes(x = total_data$menthlth_perc_bystate, y = total_data$exercise_perc_bystate)) + geom_point()
The error: Aesthetics must be either length 1 or the same as the data (54): x, y
In the aes() function of ggplot you put the bare column names from the data frame you provided for the data argument. So in your example it would be:
ggplot(total_data ,
aes(x = percent_instability,
y = perc_exercise)) +
geom_point()
Although I'm not sure what total_ex is in your example.
Also, using total_ex$menthlth_perc_bystate implies you are looking for a column named menthlth_perc_bystate in the data frame total_ex. That column does not exist, it is the name of a different data frame. Once you have merged the two data frames, the columns in the resulting data frame will be state, percent_instability, and perc_exercise.

How to make plots scales the same or trun them into Log scales in ggplot

I am using this script to plot chemical elements using ggplot2 in R:
# Load the same Data set but in different name, becaus it is just for plotting elements as a well log:
Core31B1 <- read.csv('OilSandC31B1BatchResultsCr.csv', header = TRUE)
#
# Calculating the ratios of Ca.Ti, Ca.K, Ca.Fe:
C31B1$Ca.Ti.ratio <- (C31B1$Ca/C31B1$Ti)
C31B1$Ca.K.ratio <- (C31B1$Ca/C31B1$K)
C31B1$Ca.Fe.ratio <- (C31B1$Ca/C31B1$Fe)
C31B1$Fe.Ti.ratio <- (C31B1$Fe/C31B1$Ti)
#C31B1$Si.Al.ratio <- (C31B1$Si/C31B1$Al)
#
# Create a subset of ratios and depth
core31B1_ratio <- C31B1[-2:-18]
#
# Removing the totCount column:
Core31B1 <- Core31B1[-9]
#
# Metling the data set based on the depth values, to have only three columns: depth, element and count
C31B1_melted <- melt(Core31B1, id.vars="depth")
#ratio melted
C31B1_ra_melted <- melt(core31B1_ratio, id.vars="depth")
#
# Eliminating the NA data from the data set
C31B1_melted<-na.exclude(C31B1_melted)
# ratios
C31B1_ra_melted <-na.exclude(C31B1_ra_melted)
#
# Rename the columns:
colnames(C31B1_melted) <- c("depth","element","counts")
# ratios
colnames(C31B1_ra_melted) <- c("depth","ratio","percentage")
#
# Ploting the data in well logs format using ggplot2:
Core31B1_Sp <- ggplot(C31B1_melted, aes(x=counts, y=depth)) +
theme_bw() +
geom_path(aes(linetype = element))+ geom_path(size = 0.6) +
labs(title='Core 31 Box 1 Bioturbated sediments') +
scale_y_reverse() +
facet_grid(. ~ element, scales='free_x') #rasterImage(Core31Image, 0, 1515.03, 150, 0, interpolate = FALSE)
#
# View the plot:
Core31B1_Sp
I got the following image (as you can see the plot has seven element plots, and each one has its scale. Please ignore the shadings and the image at the far left):
My question is, is there a way to make these scales the same like using log scales? If yes what I should change in my codes to change the scales?
It is not clear what you mean by "the same" because that will not give you the same result as log transforming the values. Here is how to get the log transformation, which, when combined with the no using free_x will give you the plot I think you are asking for.
First, since you didn't provide any reproducible data (see here for more on how to ask good questions), here is some that gives at least some of the features that I think your data has. I am using tidyverse (specifically dplyr and tidyr) to do the construction:
forRatios <-
names(iris)[1:3] %>%
combn(2, paste, collapse = " / ")
toPlot <-
iris %>%
mutate_(.dots = forRatios) %>%
select(contains("/")) %>%
mutate(yLocation = 1:n()) %>%
gather(Comparison, Ratio, -yLocation) %>%
mutate(logRatio = log2(Ratio))
Note that the last line takes the log base 2 of the ratio. This allows ratios in each direction (above and below 1) to plot meaningfully. I think that step is what you need. you can accomplish something similar with myDF$logRatio <- log2(myDF$ratio) if you don't want to use dplyr.
Then, you can just plot that:
ggplot(
toPlot
, aes(x = logRatio
, y = yLocation) ) +
geom_path() +
facet_wrap(~Comparison)
Gives:

Resources