How to combine ggplot and dplyr into a function? - r

Consider this simple example
library(dplyr)
library(ggplot2)
dataframe <- data_frame(id = c(1,2,3,4),
group = c('a','b','c','c'),
value = c(200,400,120,300))
# A tibble: 4 x 3
id group value
<dbl> <chr> <dbl>
1 1 a 200
2 2 b 400
3 3 c 120
4 4 c 300
Here I want to write a function that takes the dataframe and the grouping variable as input. Ideally, after grouping and aggregating I would like to print a ggpplot chart.
This works:
get_charts2 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
df_agg
}
> get_charts2(dataframe, group)
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2
Unfortunately, adding ggplot into the function above FAILS
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean, color = !!quo_var, group = !!quo_var)) +
geom_point() +
geom_line()
}
> get_charts1(dataframe, group)
Error in !quo_var : invalid argument type
I dont understand what is wrong here. Any ideas?
Thanks!
EDIT: interesting follow-up here how to create factor variables from quosures in functions using ggplot and dplyr?

ggplot does not yet support tidy eval syntax (you can't use the !!). You need to use more traditional standard evaluation calls. You can use aes_q in ggplot to help with this.
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes_q(x = quote(count), y = quote(mean), color = quo_var, group = quo_var)) +
geom_point() +
geom_line()
}
get_charts1(dataframe, group)

ggplot2 v3.0.0 released in July 2018 supports !! (bang bang), !!!, and :=. aes_()/aes_q() and aes_string() are soft-deprecated.
OP's original code should work
library(tidyverse)
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean,
color = !!quo_var, group = !!quo_var)) +
geom_point() +
geom_line()
}
get_charts1(dataframe, group)
Edit: using the tidy evaluation pronoun .data[] to slice the chosen variable from the data frame also works
get_charts2 <- function(data, mygroup){
df_agg <- data %>%
group_by(.data[[mygroup]]) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean,
color = .data[[mygroup]], group = .data[[mygroup]])) +
geom_point() +
geom_line()
}
get_charts2(dataframe, "group")
Created on 2018-04-04 by the reprex package (v0.2.0).

Related

How to loop through columns in R to create plots?

I have three columns in a dataframe: age, gender and income.
I want to loop through these columns and create plots based on the data in them.
I know in stata you can loop through variables and then run commands with those variables. However the code below does not seem to work, is there an equivalent way to do what I want to do in R?
groups <- c(df$age, df$gender, df$income)
for (i in groups){
df %>% group_by(i) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(y = prop, x = i)) +
geom_col()
}
you can also use the tidyverse. Loop through a vector of grouping variable names with map. On every iteration, you can evaluate !!sym(variable) the variable name to group_by. Alternatively, we can use across(all_of()), wihch can take strings directly as column names. The rest of the code is pretty much the same you used.
library(dplyr)
library(purrr)
groups <- c('age', 'gender', 'income')
## with !!(sym(.x))
map(groups, ~
df %>% group_by(!!sym(.x)) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(y = prop, x = i)) +
geom_col()
)
## with across(all_of())
map(groups, ~
df %>% group_by(across(all_of(.x))) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(y = prop, x = i)) +
geom_col()
)
If you want to use a for loop:
groups <- c('age', 'gender', 'income')
for (i in groups){
df %>% group_by(!!sym(i)) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(y = prop, x = i)) +
geom_col()
}
You can use lapply
df <- data.frame(age = sample(c("26-30", "31-35", "36-40", "41-45"), 20, replace = T),
gender = sample(c("M", "F"), 20, replace = T),
income = sample(c("High", "Medium", "Low"), 20, replace = T),
prop = runif(20))
lapply(df[,c(1:3)], function(x) ggplot(data = df, aes(y = df$prop, x = x))+ geom_col())

Create all possible geom_col charts of the dataset using rowwise and rlang [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I want to create and save all the possible geom_col charts of a dataset with categorical variables on Y axis and average (grouping variable being y) value of numeric variables on x axis.
I borrowed ideas from this and this link and created the following code, but struggling to make it work. Please advise.
library(tidyverse)
library(skimr)
cat_vars <- skim(mpg) %>% as.data.frame() %>% filter(skim_type == "character", character.n_unique <= 16) %>% pull(skim_variable)
num_vars <- skim(mpg) %>% as.data.frame() %>% filter(skim_type == "numeric") %>% pull(skim_variable)
vars <- cross_df(list(y= cat_vars, x = num_vars))
plots <- bind_rows(rep(list(mpg), nrow(vars)), .id = "grp") %>%
nest_by(grp) %>%
bind_cols(vars) %>%
rowwise() %>%
mutate(plot = list(~(data %>%
group_by(y) %>%
summarise("{x}" = mean(.data[[x]], na.rm = T)) %>%
ungroup()) %>%
ggplot() +
geom_col(aes(x = .data[[x]],
y = fct_reorder(.data[[y]], .data[[x]], .fun = sum, na.rm = T),
fill = .data[[y]]), width = 0.8) +
xlab(paste0("Avg. ", x)) +
ylab(y) +
theme_classic()))),
filename = paste0(x, "_by_", y, ".pdf")) %>%
select(filename, plot)
pwalk(plots, ggsave, path = getwd())
There is no need to bind_rows, nest, ... Instead:
Put your plotting code in a helper function instead of wrapping everything in a pipeline. This allows for much easier debugging as well as cleaner and clearer code.
Use map2 to loop over the columns of your df vars
One issue with your code was summarise("{x}" = mean(.data[[x]], na.rm = T)) which will create a variable named {x}. Instead use e.g. !!sym(x) := ... to assign the value back to x if x is a string.
library(tidyverse)
library(skimr)
cat_vars <- skim(mpg) %>% as.data.frame() %>% filter(skim_type == "character", character.n_unique <= 16) %>% pull(skim_variable)
num_vars <- skim(mpg) %>% as.data.frame() %>% filter(skim_type == "numeric") %>% pull(skim_variable)
vars <- cross_df(list(y= cat_vars, x = num_vars))
make_plot <- function(data, x, y) {
data <- data %>%
group_by(across(all_of(y))) %>%
summarise(!!sym(x) := mean(.data[[x]], na.rm = T), .groups = "drop") %>%
ungroup()
ggplot(data) +
geom_col(aes(x = .data[[x]],
y = fct_reorder(.data[[y]], .data[[x]], .fun = sum, na.rm = T),
fill = .data[[y]]), width = 0.8) +
xlab(paste0("Avg. ", x)) +
ylab(y) +
theme_classic()
}
plots <- map2(vars$x, vars$y, make_plot, data = mpg)
length(plots)
#> [1] 25
plots[[1]]
# Export
pwalk(list(x = vars$x, y = vars$y, p = plots), function(x, y, p) ggsave(filename = paste0(x, "_by_", y, ".pdf"), plot = p, path = getwd()))

Dynamically fitting two straight lines and getting the crossing point between those lines in R

I am having a spectral reflectance data like
library(hsdar)
library(tidyverse)
##Create some data
parameter <- data.frame(N = seq(1, 1.5, 0.05), LAI = seq(1,6,0.5))
spec <- PROSAIL(parameterList=parameter)
Then I have calculated the 1st order derivative of the data like
d1 <- derivative.speclib(spec)
I have extracted the dataframe from d1 object using following code
d1_df <- d1#spectra#spectra_ma
d1_wav <- d1#wavelength
colnames(d1_df) <- d1_wav
#Plotting of the data
matplot(d1_wav,t(d1_df[1:11,]),type='l', xlim = c(660, 800), ylim=c(-0.01,+0.01), xlab='Wavelength /nm',ylab='Reflectance')
Then I subsetted far-red (680 to 700 nm) and NIR (725 to 760 nm) region like
d1_df %>% as.data.frame() %>%
setNames(paste0("WV_", names(.))) %>%
mutate(ID = seq.int(nrow(.))) %>%
select(281:301, ID) %>%
pivot_longer(cols = -ID) %>%
separate(name, c("chr", "wv"), convert = T)
d1_df %>% as.data.frame() %>%
setNames(paste0("WV_", names(.))) %>%
mutate(ID = seq.int(nrow(.))) %>%
select(326:361, ID) %>%
pivot_longer(cols = -ID) %>%
separate(name, c("chr", "wv"), convert = T)
Now how can I fit two straight lines for the two regions and get the x corrosponding to the intersection of these two straight lines for each ID as shown in the following figure?
There is not a single unique answer to this question, because there is not a unique reflectance line (each ID has its own reflectance line and therefore its own unique crossing point). If we take your subsetted data like this:
region_A <- d1_df %>% as.data.frame() %>%
setNames(paste0("WV_", names(.))) %>%
mutate(ID = seq.int(nrow(.))) %>%
select(290:301, ID) %>%
pivot_longer(cols = -ID) %>%
separate(name, c("chr", "wv"), convert = T) %>%
mutate(ID = factor(ID))
region_B <- d1_df %>% as.data.frame() %>%
setNames(paste0("WV_", names(.))) %>%
mutate(ID = seq.int(nrow(.))) %>%
select(332:350, ID) %>%
pivot_longer(cols = -ID) %>%
separate(name, c("chr", "wv"), convert = T) %>%
mutate(ID = factor(ID))
and plot it, we see:
p <- ggplot(region_A, aes(x = wv, y = value, group = ID)) +
geom_line() +
geom_line(data = region_B)
p
If we extrapolate these lines, we can see they cross at different wavelengths:
p <- p + geom_smooth(method = "lm", formula = y ~ x, fullrange = TRUE,
aes(colour = factor(ID)), se = FALSE) +
geom_smooth(method = "lm", formula = y ~ x, fullrange = TRUE,
data = region_B, aes(colour = factor(ID)), se = FALSE) +
coord_cartesian(ylim = c(0, 0.0125))
p
We can get a linear regression on each of these lines like this:
modA <- lm(value ~ wv * ID, data = region_A)
modB <- lm(value ~ wv * ID, data = region_B)
And we can define a function that returns 0 when the prediction from both models at a particular wavelength is the same like so:
meet_at <- function(X, ID)
{
A <- predict(modA, newdata = list(wv = X, ID = ID))
B <- predict(modB, newdata = list(wv = X, ID = ID))
abs(A - B)
}
This allows us to use the optimise function to find the crossing point for each of the two lines and return a nice data frame of results as follows:
df <- do.call(rbind, lapply(unique(region_A$ID), function(i) {
wv <- optimize(meet_at, c(700, 740), ID = i)$minimum
value <- predict(modA, newdata = list(wv = wv, ID = i))
data.frame(wv, value, ID = as.character(i))
}))
df
#> wv value ID
#> 1 708.8861 0.004254394 1
#> 11 710.4923 0.005915650 2
#> 12 712.1372 0.007343448 3
#> 13 713.6095 0.008527553 4
#> 14 714.8414 0.009483770 5
#> 15 715.8220 0.010241372 6
#> 16 716.5676 0.010833544 7
#> 17 717.1078 0.011292029 8
#> 18 717.4764 0.011644701 9
#> 19 717.7071 0.011914912 10
#> 110 717.8309 0.012121712 11
and we know these results are correct if we draw these points on our plot:
p + geom_vline(data = df, aes(xintercept = wv, colour = ID)) +
geom_point(data = df)
So the answer to your question is that the crossing occurs according to the ID at wavelengths ranging between 708 and 718 nm, with the exact details as per df

How can we data wrangling to obtain shown ratio/proportion chart shown

Goal is to produce a visualization indicating ratio.
Please help us how can we produce such ratio chart (high lighted) in R ?
library(tidyverse)
# Dataset creation
df <- data.frame(cls = c(rep("A",4),rep("B",4)),
grd = c("A1",rep("A2",3),rep(c("B1","B2"), 2)),
typ = c(rep("m",2),rep("o",2),"m","n",rep("p",2)),
pnts = c(rep(1:4,2)))
df
#### Data wrangling
df1 <- df %>%
group_by(cls) %>%
summarise(cls_pct = sum(pnts))
df1
df2 <- df %>%
group_by(cls,grd) %>%
summarize(grd_pct = sum(pnts))
df2
df3 <- df %>%
group_by(cls,grd,typ) %>%
summarise(typ_pct = sum(pnts))
df3
#### Attempt to combine all df1,df2,df3
# but mutate and summarise are mixing up leading to wrong results
df3 %>%
group_by(cls,grd) %>%
mutate(grd_pct = sum(typ_pct)) %>%
group_by(cls) %>%
mutate(cls_pct = sum(grd_pct))
Attempt to visualize all the ratios in 1 chart
data %>%
pivot_longer(cols = -c(cls:pnts),
names_to = "per_cat",
values_to = "percent") %>%
ggplot(aes(cls,percent, col = typ, fill = grd)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_bw()
plot of the same.
EDIT -- added formula version with more useful output for visualization.
ORIG: At this point it may be worth making a function to reduce copying and pasting, but this may get you what you need:
library(tidyverse)
df %>%
group_by(cls) %>%
mutate(per1 = sum(pnts),
per1_pct = per1 / sum(per1)) %>%
group_by(cls, grd) %>%
mutate(per2 = sum(pnts),
per2_pct = per2 / sum(per2)) %>%
group_by(cls, grd, typ) %>%
mutate(per3 = sum(pnts),
per3_pct = per3 / sum(per3)) %>%
ungroup()
EDIT: Here's a general function to calculate the stats for a given grouping, making it easier to combine a few groupings together in long format better suited for visualization.
df_sum <- function(df, level, ...) {
df %>%
group_by(...) %>%
summarize(grp_ttl = sum(pnts)) %>%
mutate(ttl = sum(grp_ttl),
pct = grp_ttl / ttl) %>%
ungroup() %>%
mutate(level = {{ level }} )
}
df_sum(df, level = 1, cls) %>%
bind_rows(df_sum(df, level = 2, cls, grd)) %>%
bind_rows(df_sum(df, level = 3, cls, grd, typ)) %>%
mutate(label = coalesce(as.character(typ), # This grabs the first non-NA
as.character(grd),
as.character(cls))) -> df_summed
df_summed %>%
ggplot(aes(level, grp_ttl)) +
geom_col(color = "white") +
geom_text(aes(label = paste0(label, "\n", grp_ttl, "/", ttl)),
color = "white",
position = position_stack(vjust = 0.5)) +
scale_x_reverse() + # To make level 1 at the top
coord_flip() # To switch from vertical to horizontal orientation

How to use a prior median in ggplot aesthetic in R?

I am trying to calculate a median as one number, and then use that number as a value in the aesthetic of ggplot.
I first try to obtain the median as a value:
mean_delay_median <- nycflights13::flights %>%
group_by(dest) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
median(mean_delay)
This yields the error message:
Error in median.default(., mean_delay) : need numeric data
How can I fix this?
Once I get this working, my second step would be to color a map based on values above and below this median "mean_delay_median", with something like this:
nycflights13::flights %>%
group_by(dest) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
inner_join(nycflights13::airports, c('dest' = 'faa')) %>%
ggplot(aes(lon, lat, color=mean_delay>mean_delay_median)) +
borders("state") +
geom_point() +
coord_quickmap()
In general, I seek guidance on using prior statistics in subsequent code.
Thanks!
you just miss the summarise(median_all_delay = median(mean_delay, na.rm = TRUE))
try this:
mean_delay_median <- nycflights13::flights %>%
group_by(dest) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
summarise(median_all_delay = median(mean_delay, na.rm = TRUE)) %>%
unlist()
nycflights13::flights %>%
group_by(dest) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
inner_join(nycflights13::airports, c('dest' = 'faa')) %>%
ggplot(aes(lon, lat, color=mean_delay>mean_delay_median)) +
borders("state") +
geom_point() +
coord_quickmap()
output is:
You should note that one destination(dest) has every arr_delay observation missing.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(dest) %>%
filter(all(is.na(arr_delay))) %>%
select(dest, arr_delay)
#> # A tibble: 1 x 2
#> # Groups: dest [1]
#> dest arr_delay
#> <chr> <dbl>
#> 1 LGA NA
This leads to NaN, not zero.
mean(c(NA), na.rm = TRUE)
#> [1] NaN
In other words, you should add na.rm = TRUE again in median function.
flights %>%
group_by(dest) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
mutate(arrival = ifelse(mean_delay > median(mean_delay, na.rm = TRUE), "late", "okay")) %>% # na.rm option to median
inner_join(airports, by = c("dest" = "faa")) %>%
ggplot() +
aes(lon, lat, colour = arrival) +
borders("state") +
geom_point() +
coord_quickmap()
Since mean value for LGA does not have any value, its label might become NA.

Resources