Ggplot subset data functions and dplyr - r

When doing data analysis, we often use dplyr to modify the dataframe further in specific geoms. This allows us to change the default dataframe of a ggplot later, and have everything still work.
template <- ggplot(db, aes(x=time, y=value)) +
geom_line(data=function(db){db %>% filter(event=="Bla")}) +
geom_ribbon(aes(ymin=low, ymax=up))
ggsave( template, "global.png" )
for(i in unique(db$simulation))
ggsave( template %+% subset(db, simulation==i), paste0(i, ".png")
Is there a nicer/shorter way to specify the filter command, e.g. using some magical .?
EDIT
To clarify some of the comments: By using geom_line(data = db %>% filter(event=="Bla")), the layer would not be updated when I change the default dataframe later using %+%. I am really aiming to use the data argument of geom_* as a function.

Upon reading the documentation of %>% better, I have found the solution:
Using the dot-place holder as lhs
When the dot is used as lhs, the result will be a functional sequence, i.e. a function which applies the entire chain of right-hand sides in turn to its input. See the examples.
Therefore, the nicest way to formulate the above example, incorporating the suggestions from above as well:
db <- diamonds
template <- ggplot(db, aes(x=carat, y=price, color=cut)) +
geom_point() +
geom_smooth(data=. %>% filter(color=="J")) +
labs(caption="Smooths only for J color")
ggsave( template, "global.png" )
db %>% group_by(cut) %>% do(
ggsave( paste0(.$cut[1], ".png"), plot=template %+% .)
)

Related

How to for loop in R over a variable

My question is about using a for loop to repeat data analysis based on a categorial variable.
Using the built in Iris data set how would I run a for loop on the code below so it first produces this chart for just setosa and then versicolor and then virginica without me having to manually change/set the species?
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point()
I'm just starting out and have no idea what I'm doing
You need to use print() as described here
library(tidyverse)
data(iris)
species <- iris |> distinct(Species) |> unlist()
for(i in species) {
p <- iris |>
filter(Species == i) |>
ggplot() +
geom_point(aes(x=Sepal.Length, y=Sepal.Width)) +
ggtitle(i)
print(p)
}
You can use a for loop as u/DanY posted; however, it's harder to store and retrieve plots in a universal way with that structure. Running the loop code makes it difficult to retrieve any one particular plot - you would only see the last plot in the output window and have to go "back" to see the others. I would suggest using a list structure instead to allow you to retrieve any one of the individual plots in subsequent functions.
For this, you can use lapply() rather than for(...) { ... }.
Here's an example which uses dplyr and tidyr:
library(ggplot2)
library(dplyr)
library(tidyr)
unique_species <- unique(iris$Species)
myPlots <- lapply(unique_species, function(x) {
ggplot(
data = iris %>% dplyr::filter(Species == x),
mapping = aes(x=Sepal.Length, y=Sepal.Width)
) +
geom_point() +
labs(title=paste("Plot of ", x))
})
You then have the plots stored within myPlots. You can access each plot via myPlots[1], myPlots[2] or myPlots[3]... or you can plot them all together via patchwork or another similar package. Here's one way using cowplot:
cowplot::plot_grid(plotlist = myPlots, nrow=1)

purrr::pmap() output incompatible with what ggplot::aes() expects

Problem: purrr::pmap() output incompatible with ggplot::aes()
The following reprex boils down to a single question, is there anyway we can use the quoted variable names inside ggplot2::aes() instead of the plain text names? Example: we typically use ggplot(mpg, aes(displ, cyl)) , how to make aes() work normally with ggplot(mpg, aes("displ", "cyl")) ?
If you understood my question, the remainder of this reprex really adds no information. However, I added it to draw the full picture of the problem.
More details: I want to use purrr functions to create a bunch of routinely exploratory data analysis plots effortlessly. The problem is, purrr::pmap() results the string-quoted name of the variables, which ggplot::aes() doesn't understand. As far as I'm concerned, the functions cat() and as.name() can take the string-quoted variable name and return it in the very typical way that aes() understands; unquoted. However, neither of them worked. The following reprex reproduces the problem. I commented the code to spare you the pain of figuring out what the code does.
library(tidyverse)
# Divide the classes of variables into numeric and non-numeric. Goal: place a combination of numeric variables on the axes wwhile encoding a non-numeric variable.
mpg_numeric <- map_lgl(.x = seq_along(mpg), .f = ~ mpg[[.x]] %>% class() %in% c("numeric","integer"))
mpg_factor <- map_lgl(.x = seq_along(mpg), .f = ~ mpg[[.x]] %>% class() %in% c("factor","character"))
# create all possible combinations of the variables
eda_routine_combinations <- expand_grid(num_1 = mpg[mpg_numeric] %>% names(),
num_2 = mpg[mpg_numeric] %>% names(),
fct = mpg[mpg_factor] %>% names()) %>%
filter(num_1 != num_2) %>% slice_head(n = 2) # for simplicity, keep only the first 2 combinations
# use purrr::pmap() to create all the plots we want in a single call
pmap(.l = list(eda_routine_combinations$num_1,
eda_routine_combinations$num_2,
eda_routine_combinations$fct) ,
.f = ~ mpg %>%
ggplot(aes(..1 , ..2, col = ..3)) +
geom_point() )
Next we pinpoint the problem using a typical ggplot2 call.
this is what we want purrr::pmap() to create in its iterations:
mpg %>%
ggplot(aes(displ , cyl, fill = drv)) +
geom_boxplot()
However, this is purrr::pmap() renders; quoted variable names:
mpg %>%
ggplot(aes("displ" , "cyl", fill = "drv")) +
geom_boxplot()
Failing attempts
Using cat() to transform the quoted variable names from pmap() into unquoted form for aes() to understand fails.
mpg %>%
ggplot(aes(cat("displ") , cat("cyl"), fill = cat("drv"))) +
geom_boxplot()
Using as.name() to transform the quoted variable names from pmap() into unquoted form for aes() to understand fails.
mpg %>%
ggplot(aes(as.name("displ") , as.name("cyl"), fill = as.name("drv"))) +
geom_boxplot()
Bottom line
Is there a way to make ggplot(aes("quoted_var_name")) work properly?

ggplot par new=TRUE option

I am trying to plot 400 ecdf graphs in one image using ggplot.
As far as I know ggplot does not support the par(new=T) option.
So the first solution I thought was use the grid.arrange function in gridExtra package.
However, the ecdfs I am generating are in a for loop format.
Below is my code, but you could ignore the steps for data processing.
i=1
for(i in 1:400)
{
test<-subset(df,code==temp[i,])
test<-test[c(order(test$Distance)),]
test$AI_ij<-normalize(test$AI_ij)
AI = test$AI_ij
ggplot(test, aes(AI)) +
stat_ecdf(geom = "step") +
scale_y_continuous(labels = scales::percent) +
theme_bw() +
new_theme +
xlab("Calculated Accessibility Value") +
ylab("Percent")
}
So I have values stored in "AI" in the for loop.
In this case how should I plot 400 graphs in the same chart?
This is not the way to put multiple lines on a ggplot. To do this, it is far easier to pass all of your data together and map code to the "group" aesthetic to give you one ecdf line for each code.
By far the hardest part of answering this question was attempting to reverse-engineer your data set. The following data set should be close enough in structure and naming to allow the code to be run on your own data.
library(dplyr)
library(BBmisc)
library(ggplot2)
set.seed(1)
all_codes <- apply(expand.grid(1:16, LETTERS), 1, paste0, collapse = "")
temp <- data.frame(sample(all_codes, 400), stringsAsFactors = FALSE)
df <- data.frame(code = rep(all_codes, 100),
Distance = sqrt(rnorm(41600)^2 + rnorm(41600)^2),
AI_ij = rnorm(41600),
stringsAsFactors = FALSE)
Since you only want the first 400 codes from temp that appear in df to be shown on the plot, you can use dplyr::filter to filter out code %in% test[[1]] rather than iterating through the whole thing one element at a time.
You can then group_by code, and arrange by Distance within each group before normalizing AI_ij, so there is no need to split your data frame into a new subset for every line: the data is processed all at once and the data frame is kept together.
Finally, you plot this using the group aesthetic. Note that because you have 400 lines on one plot, you need to make each line faint in order to see the overall pattern more clearly. We do this by setting the alpha value to 0.05 inside stat_ecdf
Note also that there are multiple packages with a function called normalize and I don't know which one you are using. I have guessed you are using BBmisc
So you can get rid of the loop and do:
df %>%
filter(code %in% temp[[1]]) %>%
group_by(code) %>%
arrange(Distance, by_group = TRUE) %>%
mutate(AI = normalize(AI_ij)) %>%
ggplot(aes(AI, group = code)) +
stat_ecdf(geom = "step", alpha = 0.05) +
scale_y_continuous(labels = scales::percent) +
theme_bw() +
xlab("Calculated Accessibility Value") +
ylab("Percent")

Use dplyr SE with ggplot2

I often combine dplyr with ggplot2 in wrapper functions for analysis. As I am moving to the new NSE / SE paradigm of v.0.7.1 with tidyeval, I am struggling to get this combination to work. I found that ggplot does not understand unquoted quosers (yet). The following does not work:
example_func <- function(col) {
col <- enquo(col)
mtcars %>% count(!!col) %>%
ggplot(aes((!!col), n)) +
geom_bar(stat = "identity")
}
example_func(cyl)
# Error in !col : invalid argument type
I currently use the following work-around. But I assume there must be a better way.
example_func2 <- function(col) {
col <- enquo(col)
mtcars %>% count(!!col) %>%
ggplot(aes_string(rlang::quo_text(col), "n")) +
geom_bar(stat = "identity")
}
Please show me what the best way to combine these two. Thanks!
If you are already handling quosures it's easier to use aes_ which accepts inputs quoted as a formula: aes_(col, ~n).
This bit of code solves your problem:
library(tidyverse)
example_func <- function(col) {
col <- enquo(col)
mtcars %>% count(!!col) %>%
ggplot(aes_(col, ~n)) +
geom_bar(stat = "identity")
}
example_func(cyl)
There seem to be two ways of thinking about this.
Approach 1: Separation of concerns.
I like my plotting stuff to be very much separate from my wrangling stuff. Also, you can name your group which feels like the easiest method to solve your problem [although you do loose the original column name]. So one method of solving what you're trying to do can be via;
library(tidyverse)
concern1_data <- function(df, col) {
group <- enquo(col)
df %>%
group_by(group = !!group) %>%
summarise(n = n())
}
concern2_plotting <- function(df){
ggplot(data=df) +
geom_bar(aes(group, n), stat = "identity")
}
mtcars %>%
concern1_data(am) %>%
concern2_plotting()
This achieves what you're trying to do more or less and keeps concerns apart (which deserves a mention).
Approach 2: Accept and Wait
Thing is: tidyeval is not yet implemented in ggplot2.
- Colin Fay from link
I think this is support that is currently not in ggplot2 but I can't imagine that ggplot2 won't get this functionality. It's just not there yet.

dplyr and ggplot piping is not working as expected

I find no solution for these two following issues:
First I try this:
library(tidyverse)
gg <- mtcars %>%
mutate(group=ifelse(gear==3,1,2)) %>%
ggplot(aes(x=carb, y=drat)) + geom_point(shape=group)
Error in layer(data = data, mapping = mapping, stat = stat, geom =
GeomPoint,:object 'group' not found
which is obviously not working. But using something like this .$group is also not successfull. Of note, I have to specifiy the shape outside from aes()
The second problem is this. I'm not able to call a saved ggplot (gg) within a pipe.
gg <- mtcars %>%
mutate(group=ifelse(gear==3,1,2)) %>%
ggplot(aes(x=carb, y=drat)) + geom_point()
mtcars %>%
filter(vs == 0) %>%
gg + geom_point(aes(x=carb, y=drat), size = 4)
Error in gg(.) : could not find function "gg"
Thanks for your help!
Edit
After a long time I found a solution here. One has to set the complete ggplot term in {}.
mtcars %>%
mutate(group=ifelse(gear==3,1,2)) %>% {
ggplot(.,aes(carb,drat)) +
geom_point(shape=.$group)}
If you wrap your shape definition in aes() you can get the desired behavior. To use shape outside of aes() you can pass it a single value (ie shape=1). Also note that group is converted to a discrete var, geom_point throws an error when you pass a continuous var to shape.
library(tidyverse)
gg <- mtcars %>%
mutate(group=ifelse(gear==3,1,2)) %>%
ggplot(aes(x=carb, y=drat)) +
geom_point(aes(shape=as.factor(group)))
gg
Second, the %>% operator, when called as lhs %>% rhs, assumes that the rhs is a function. So as the error shows, you are calling gg as a function. Calling a plot as a function on a dataframe (ie gg(mtcars)) isnt a valid operation.
See #docendo discimus comment on the question for how to use {} to accomplish adding a layer to an existing ggplot object from a magrittr pipeline.

Resources