I'm using do() to fit a model to grouped data, and then I want to plot the fit for each group. In plyr, I guess I would use d_ply(). In dplyr, I'm trying either do() or summarise() using a function that makes the plot as a side effect.
I'm getting different results depending on whether I use do() or summarise(), and I'm not sure why. Specifically it seems like summarise() isn't operating on each row correctly.
Here's my example:
require(nycflights13)
require(mgcv)
# fit a gam to the flights grouped by dest (from ?do)
by_dest <- flights %>% group_by(dest) %>% filter(n() > 100)
models = by_dest %>% do(smooth = gam(arr_delay ~ s(dep_time) + month, data = .))
# print the first 4 rows, the dest is ABQ, ACK, ALB, ATL
models %>% slice(1:4)
# make a function to plot the models, titled by dest
plot.w.title = function(title, gam.model){
plot.gam(gam.model, main=title)
return(1)
}
# This code makes plots with the wrong titles, for example ATL is listed twice:
models %>%
slice(1:4) %>%
rowwise %>%
summarise(useless.column = plot.w.title(dest, smooth)) # for plot side effect
# this code gives me the correct titles...why the difference?
models %>%
slice(1:4) %>%
rowwise %>%
do(useless.column = plot.w.title(.$dest, .$smooth))
The summarise() method will work if you modify the function by applying unique() to the title:
plot.w.title = function(title, gam.model){
plot.gam(gam.model, main=unique(title))
return(1)
}
Related
I want to use to following exampe to do t-tests with multiple variables - The code is used from https://www.datanovia.com/en/blog/how-to-perform-multiple-t-test-in-r-for-different-variables/:
options(scipen = 99)
# Load required R packages
library(tidyverse)
library(rstatix)
library(ggpubr)
# Prepare the data and inspect a random sample of the data
mydata <- iris %>%
filter(Species != "setosa") %>%
as_tibble()
mydata %>% sample_n(6)
# Transform the data into long format
# Put all variables in the same column except `Species`, the grouping variable
mydata.long <- mydata %>%
pivot_longer(-Species, names_to = "variables", values_to = "value")
mydata.long %>% sample_n(6)
stat.test <- mydata.long %>%
group_by(variables) %>%
t_test(value ~ Species) %>%
adjust_pvalue(method = "BH") %>%
add_significance()
stat.test
This tutorial uses the t_test function of the rstatix package. It works great, but is there a way to disable the scientific notation of the p-values? I want to output p-values like 0.000445 instead of 4.45e-4.
Unfortunetely the use of
options(scipen = 99)
did not change anything.
Thank you!
EDIT: The solution can be found in the comments - it is necessary to call stat.test this way:
as.data.frame(stat.test)
Thank rawr for his comment!
I'm looking for a way to apply a function to either specified labels, or to all labels that are included in the plot. The goal is to have neat human readable labels that derive from the default labels, without having to specify each.
To demonstrate what I am looking for in terms of the input variable names and the output, I am including an example based on the starwars data set, that uses the versatile snakecase::to_sentence_case() function, but this could apply to any function, including ones that expand short variable names in pre-determined ways:
library(tidyverse)
library(snakecase)
starwars %>%
filter(mass < 1000) %>%
mutate(species = species %>% fct_infreq %>% fct_lump(5) %>% fct_explicit_na) %>%
ggplot(aes(height, mass, color=species, size=birth_year)) +
geom_point() +
labs(
x = to_sentence_case("height"),
y = to_sentence_case("mass"),
color = to_sentence_case("species"),
size = to_sentence_case("birth_year")
)
Which produces the following graph:
The graph is the desired output, but requires that each of the labels be specified by hand, increasing the possibility of error if the variables are later changed. Note that if I had not specified the labels, all the labels would have been applied automatically, but with the variable names instead of the prettier versions.
This issue seems to be somewhat related to what the labeller() function is intended for, but it seems that it only applies to facetting. Another related issue is raised in this question. However, both of these seem to apply only to values contained within the data, not to the variable names that are being used in the plot, which is what I am looking for.
The very helpful answer by #z-lin demonstrated to me a simple way to do this by simply modifying the plot object before printing.
The intended result can be achieved with the help of gg_apply_labs(), a short function that will apply an arbitrary string processing function to the $labels of a plot object. The resulting code should be a self-contained illustration of this approach:
# Packages
library(tidyverse)
library(snakecase)
# This applies fun to each label present in the plot object
#
# fun should accept and return character vectors, it can either be a simple
# prettyfying function or it can perform more complex lookup to replace
# variable names with variable labels
gg_apply_labs <- function(p, fun) {
p$labels <- lapply(p$labels, fun)
p
}
# This gives the intended result
# Note: The plot is assigned to a named variable before piping to apply_labs()
p <- starwars %>%
filter(mass < 1000) %>%
mutate(species = species %>% fct_infreq %>% fct_lump(5) %>% fct_explicit_na) %>%
ggplot(aes(height, mass, color=species, size=birth_year)) +
geom_point()
p %>% gg_apply_labs(to_sentence_case)
# This also gives the intended result, in a single pipeline
# Note: It is important to put in the extra parentheses!
(starwars %>%
filter(mass < 1000) %>%
mutate(species = species %>% fct_infreq %>% fct_lump(5) %>% fct_explicit_na) %>%
ggplot(aes(height, mass, color=species, size=birth_year)) +
geom_point()) %>%
gg_apply_labs(to_sentence_case)
# This DOES NOT give the intended result
# Note: The issue is probably order precedence
starwars %>%
filter(mass < 1000) %>%
mutate(species = species %>% fct_infreq %>% fct_lump(5) %>% fct_explicit_na) %>%
ggplot(aes(height, mass, color=species, size=birth_year)) +
geom_point() %>%
gg_apply_labs(to_sentence_case)
A simple solution is to pipe through rename_all (or rename_if if you want more control) before plotting:
library(tidyverse)
library(snakecase)
starwars %>%
filter(mass<1000) %>%
mutate(species=species %>% fct_infreq %>% fct_lump(5) %>% fct_explicit_na) %>%
rename_all(to_sentence_case) %>%
#rename_if(is.character, to_sentence_case) %>%
ggplot(aes(Height, Mass, color=Species, size=`Birth year`)) +
geom_point()
#> Warning: Removed 23 rows containing missing values (geom_point).
Created on 2019-11-25 by the reprex package (v0.3.0)
Note, though, that the variables given to aes in ggplot in this case must be modified to match the modified sentence case variable names.
You can modify a ggplot object's appearance at the point of printing / plotting it, without affecting the original plot object, using trace:
trace(what = ggplot2:::ggplot_build.ggplot,
tracer = quote(plot$labels <- lapply(plot$labels,
<whatever string function you desire>)))
This will change the appearance of all existing / new ggplot objects you wish to plot / save, until you turn off the trace via either untrace(...) or tracingState(on = FALSE).
Illustration
Create a normal plot with default labels in lower case:
library(tidyverse)
p <- starwars %>%
filter(mass < 1000) %>%
mutate(species=species %>% fct_infreq %>% fct_lump(5) %>% fct_explicit_na) %>%
ggplot(aes(height, mass, color=species, size=birth_year)) +
geom_point() +
theme_bw()
p # if we print the plot now, all labels will be lower-case
Apply a function to modify the appearance of all labels:
trace(what = ggplot2:::ggplot_build.ggplot,
tracer = quote(plot$labels <- lapply(plot$labels,
snakecase::to_sentence_case)))
p # all labels will be in sentence case
trace(what = ggplot2:::ggplot_build.ggplot,
tracer = quote(plot$labels <- lapply(plot$labels,
snakecase::to_screaming_snake_case)))
p # all labels will be in upper case
trace(what = ggplot2:::ggplot_build.ggplot,
tracer = quote(plot$labels <- lapply(plot$labels,
snakecase::to_random_case)))
p # all letters in all labels may be in upper / lower case randomly
# (exact order can change every time we print the plot again, unless we set the same
# random seed for reproducibility)
trace(what = ggplot2:::ggplot_build.ggplot,
tracer = quote(plot$labels <- lapply(plot$labels,
function(x) paste("!!!", x, "$$$"))))
p # all labels now have "!!!" in front & "$$$" behind (this is a demonstration for
# an arbitrary user-defined function, not a demonstration of good taste in labels)
Toggle between applying & not applying the function:
tracingState(on = FALSE)
p # back to sanity, temporarily
tracingState(on = TRUE)
p # plot labels are affected by the function again
untrace(ggplot2:::ggplot_build.ggplot)
p # back to sanity, permanently
I am trying to create a heat map in R using three factors. I would like to be able to fill the colour using the modal category of one of the factors but I have not been able to find out how to do this.
When I try ggplot with geom_tile, it does produce the heatmap, however, I am not sure how it chooses the value of the fill variable. It certainly isn't the mode because I've checked this.
For instance, using the inbuilt dataset ChickWeight, I would like the fill to be based on the modal (most frequent) category of a variable "weight_group" I created.
data(ChickWeight)
glimpse(ChickWeight)
ChickWeight$Time <- ifelse(ChickWeight$Time >= 10,1,0)
ChickWeight <- ChickWeight %>% mutate(weight_group = ntile(weight, 3))
ChickWeight$Diet <- as.factor(ChickWeight$Diet)
ChickWeight$Time <- as.factor(ChickWeight$Time)
ChickWeight$weight_group <- as.factor(ChickWeight$weight_group)
table(ChickWeight$Diet, ChickWeight$Time, ChickWeight$weight_group)
ggplot(data = ChickWeight, aes(x=Time, y=Diet, fill=weight_group)) +
geom_tile()
Based on the three-way table, the bottom right block should be pink (corresponding to weight_group==1) rather than green as the modal category of weight_group when Diet==1 & Time==1 is weight_group==1 (11 counts).
Any help on this would be greatly appreciated.
Thank you!
You can define a function getMode that calculates the mode of a vector using plyr's count function to create a data frame of the counts for each class. Then sort the data frame and get the top value.
library(plyr)
getMode <- function(vec){
df <- plyr::count(vec) %>%
arrange(-freq)
return(df[1,"x"])
}
From here group by time and diet so you can find the mode for each combination of these groups and then use this as the fill for ggplot.
ChickWeight %>%
group_by(Time, Diet) %>%
summarize(modeWeightGroup = getMode(weight_group)) %>%
ggplot(aes(x=Time, y=Diet, fill= modeWeightGroup)) +
geom_tile()
I also don't think that the bottom right square should be weight_group 1 because it looks like the three way table is already sorted based on weight_group so that square is saying that of chicks in weight_group 1, their modal time, diet combination is (1,1).
Using dplyr to count the most frequent category of weight_group for each combination of Time and Diet :
ChickWeight %>%
group_by(Time, Diet) %>%
count(weight_group) %>%
filter(n == max(n)) %>%
ggplot(
aes(x = Time,
y = Diet,
fill = weight_group)
) +
geom_tile()
By the way, since you already know dplyr::mutate, you should know you can do all the pre-processing you are doing here inside a single mutate.
That means instead of :
ChickWeight$Time <- ifelse(ChickWeight$Time >= 10,1,0)
ChickWeight <- ChickWeight %>% mutate(weight_group = ntile(weight, 3))
ChickWeight$Diet <- as.factor(ChickWeight$Diet)
ChickWeight$Time <- as.factor(ChickWeight$Time)
ChickWeight$weight_group <- as.factor(ChickWeight$weight_group)
you can simply type :
ChickWeight <-
ChickWeight %>%
mutate(
Time = as.factor(ifelse(Time>=10, 1 ,0)),
Diet = as.factor(Diet),
weight_group = as.factor(ntile(weight, 3))
)
I have a dataset in which I have one numeric variable and many categorical variables. I would like to make a grid of density plots, each showing the distribution of the numeric variable for different categorical variables, with the fill corresponding to subgroups of each categorical variable. For example:
library(tidyverse)
library(nycflights13)
dat <- flights %>%
select(carrier, origin, distance) %>%
mutate(origin = origin %>% as.factor,
carrier = carrier %>% as.factor)
plot_1 <- dat %>%
ggplot(aes(x = distance, fill = carrier)) +
geom_density()
plot_1
plot_2 <- dat %>%
ggplot(aes(x = distance, fill = origin)) +
geom_density()
plot_2
I would like to find a way to quickly make these two plots. Right now, the only way I know how to do this is to create each plot individually, and then use grid_arrange to put them together. However, my real dataset has something like 15 categorical variables, so this would be very time intensive!
Is there a quicker and easier way to do this? I believe that the hardest part about this is that each plot has its own legend, so I'm not sure how to get around that stumbling block.
This solutions gives all the plots in a list. Here we make a single function that accepts a variable that you want to plot, and then use lapply with a vector of all the variables you want to plot.
fill_variables <- vars(carrier, origin)
func_plot <- function(fill_variable) {
dat %>%
ggplot(aes(x = distance, fill = !!fill_variable)) +
geom_density()
}
plotlist <- lapply(fill_variables, func_plot)
If you have no idea of what those !! mean, I recommend watching this 5 minute video that introduces the key concepts of tidy evaluation. This is what you want to use when you want to create this sorts of wrapper functions to do stuff programmatically. I hope this helps!
Edit: If you want to feed an array of strings instead of a quosure, you can change !!fill_variable for !!sym(fill_variable) as follows:
fill_variables <- c('carrier', 'origin')
func_plot <- function(fill_variable) {
dat %>%
ggplot(aes(x = distance, fill = !!sym(fill_variable))) +
geom_density()
}
plotlist <- lapply(fill_variables, func_plot)
Alternative solution
As #djc wrote in the comments, I'm having trouble passing the column names into 'fill_variables'. Right now I am extracting column names using the following code...
You can separate the categorical and numerical variables like; cat_vars <- flights[, sapply(flights, is.character)] for categorical variables and cat_vars <- flights[, sapply(flights, !is.character)] for continuous variables and then pass these vectors into the wrapper function given by mgiormenti
Full code is given below;
library(tidyverse)
library(nycflights13)
cat_vars <- flights[, sapply(flights, is.character)]
cont_vars<- flights[, !sapply(flights, is.character)]
dat <- flights %>%
select(carrier, origin, distance) %>%
mutate(origin = origin %>% as.factor,
carrier = carrier %>% as.factor)
func_plot_cat <- function(cat_vars) {
dat %>%
ggplot(aes(x = distance, fill = !!cat_vars)) +
geom_density()
}
func_plot_cont <- function(cont_vars) {
dat %>%
ggplot(aes(x = distance, fill = !!cont_vars)) +
geom_point()
}
plotlist_cat_vars <- lapply(cat_vars, func_plot_cat)
plotlist_cont_vars<- lapply(cont_vars, func_plot_cont)
print(plotlist_cat_vars)
print(plotlist_cont_vars)
One really cool feature from the ggplot2 package that I never really exploited enough was adding lists of layers to a plot. The fun thing about this was that I could pass a list of layers as an argument to a function and have them added to the plot. I could then get the desired appearance of the plot without necessarily returning the plot from the function (whether or not this is a good idea is another matter, but it was possible).
library(ggplot2)
x <- ggplot(mtcars,
aes(x = qsec,
y = mpg))
layers <- list(geom_point(),
geom_line(),
xlab("Quarter Mile Time"),
ylab("Fuel Efficiency"))
x + layers
Is there a way to do this with pipes? Something akin to:
#* Obviously isn't going to work
library(dplyr)
action <- list(group_by(am, gear),
summarise(mean = mean(mpg),
sd = sd(mpg)))
mtcars %>% action
To construct a sequence of magrittr steps, start with .
action = . %>% group_by(am, gear) %>% summarise(mean = mean(mpg), sd = sd(mpg))
Then it can be used as imagined in the OP:
mtcars %>% action
Like a list, we can subset to see each step:
action[[1]]
# function (.)
# group_by(., am, gear)
To review all steps, use functions(action) or just type the name:
action
# Functional sequence with the following components:
#
# 1. group_by(., am, gear)
# 2. summarise(., mean = mean(mpg), sd = sd(mpg))
#
# Use 'functions' to extract the individual functions.