lapply on a list of dataframe to create plots - r

I have created a list of dataframe df_list, each of which is an 18 x 4 dataframe.
The first columns of the dataframe is 18-times-repeated gene name, and the rest three columns are the gene's information. Each dataframe describes different gene.
Now I'd like to iterate the list of dataframe (i.e, a list of gene and their respective information) over the boxplot, to get plots on each gene; however, I am not sure how to deal with the ggtitle below:
Here is my simplified boxplot function:
box <- function(df){
df %>%
ggplot(df, aes(x = df[,4], y = df[,2])) +
geom_boxplot() +
ggtitle(g)
}
g is the gene name in each dataframe in the df_list
and when I run lapply(df_list,box),
I got Error: Mapping should be created with aes()oraes_().
Does anyone know how to fix this? Thank you.

As you used dplyr pipe, the first argument of ggplot() is already filled, leading df to be understood as aes argument.
box <- function(df){
df %>%
ggplot(aes(x = df[,4], y = df[,2])) +
geom_boxplot() +
ggtitle(g)
}

Instead of using df[,4] or df[,2] use column names in aes. Assuming the column names on x-axis is col1 and that on y-axis is col2 try -
box <- function(df, g){
df %>%
ggplot(aes(x = col1, y = col2)) +
geom_boxplot() +
ggtitle(df$g[1])
}
lapply(df_list,box)
g is the column name in the dataframe, so we can take the first value from it in title.

This is how to do it with another example dataset (Species instead of genes):
library(tidyverse)
plots <-
iris %>%
pivot_longer(-Species) %>%
nest(-Species) %>%
mutate(
plt = data %>% map2(Species, ~ {
.x %>%
ggplot(aes(name, value)) +
geom_boxplot() +
labs(title = .y)
})
) %>%
pull(plt)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(name, value)`?
plots[[1]]
Created on 2021-09-09 by the reprex package (v2.0.1)

Related

Extract ggplot from a nested dataframe

I have created a set of ggplots using a grouped dataframe and the map function and I would like to extract the plots to be able to manipulate them individually.
library(tidyverse)
plot <- function(df, title){
df %>% ggplot(aes(class)) +
geom_bar() +
labs(title = title)
}
plots <- mpg %>% group_by(manufacturer) %>% nest() %>%
mutate(plots= map(.x=data, ~plot(.x, manufacturer)))
nissan <- plots %>% filter(manufacturer == "nissan") %>% pull(plots)
nissan
nissan + labs(title = "Nissan")
In this case, "nissan" is a list object and I am not able to manipulate it. How do I extract the ggplot?
In terms of data structures, I think retaining a tibble (or data.frame) is suboptimal with respect to the illustrated usage. If you have one plot per manufacturer, and you plan to access them by manufacturer, then I would recommend to transmute and then deframe out to a list object.
That is, I would find it more conceptually clear here to do something like:
library(tidyverse)
plot <- function(df, title){
df %>% ggplot(aes(class)) +
geom_bar() +
labs(title = title)
}
plots <- mpg %>%
group_by(manufacturer) %>% nest() %>%
transmute(plot=map(.x=data, ~plot(.x, manufacturer))) %>%
deframe()
plots[['nissan']]
plots[['nissan']] + labs(title = "Nissan")
Otherwise, if you want to keep the tibble, another option similar to what has been suggested in the comments is to use a first() after the pull.

Plot percentages in R as blocks

I have the table to the left
table <- cbind(c("x1","x2", "x3"), c("0.4173","0.9211","0.0109"))
and is trying to make the plot two the right.
Is there any packages in R, which can do, what I'm trying to achieve?
A base R, option would be to use barplot applied on a named vector
barplot(v1)
Or convert to two column data.frame with stack and use the formula method
barplot(values ~ ind, stack(v1))
Or we can can use tidyverse with ggplot
library(dplyr)
library(ggplot2)
library(tidyr)
library(tibble)
enframe(v1, name = "id", value = 'block') %>%
mutate(non_block = 1 - block) %>%
pivot_longer(cols = -id) %>%
ggplot(aes(x = id, y = value, fill = name)) +
geom_col() +
coord_flip() +
theme_bw()
-output
data
v1 <- setNames(c(0.4173, 0.9211, 0.0109), paste0("x", 1:3))

ifelse condition: is in top n

Usually when i need a subset on geom_label() i use ifelse() and i specify a number as below:
library(tidyverse)
data = starwars %>% filter(mass < 500)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year > 100, name, NA))) +
geom_point() +
geom_label()
#> Warning: Removed 54 rows containing missing values (geom_label).
Created on 2020-05-31 by the reprex package (v0.3.0)
But with the dataset i'm working on, i need a dynamic solution, something like ifelse("birth_year is in top n", name, NA).
Thoughts?
For your method, I think using rank should work fine, e.g.,
ifelse(rank(birth_year) < 10, name, NA))
You can use rank(-birth_year) if you want it sorted the other way (or, if you're using dplyr, rank(desc(birth_year)), which will work on non-numeric columns too). You may want to read up on tie methods at ?rank.
I'd also propose a more general solution: filtering data for the geom_label layer. For more complex conditions (e.g., where a group_by would come in handy) it will be more straightforward:
data %>%
ggplot(aes(x = mass, y = height, label = name)) +
geom_point() +
geom_label(
data = data %>%
group_by(species) %>%
top_n(n = 1, wt = desc(birth_year)) # youngest of each species
)
Something like this? To get top 4 values.
library(ggplot2)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year >= sort(birth_year, decreasing = TRUE)[4], name, NA))) +
geom_point() +
geom_label()
This is a more explicit approach. I assume you want to count the number of characters per birth year, per your example. In this case, we handle the ranking first, then add a column to the original dataset, then plot. The new 'label' field is either blank/NA or has members of the top set. I suppress the pesky missing data warning in the geom_label arguments.
data = starwars %>% filter(mass < 500)
# counts names per birthyear, returns vector of top 4
top4 <- data %>%
drop_na(birth_year) %>%
count(birth_year, sort = TRUE) %>%
top_n(4) %>%
pull(birth_year)
# adds column to data with the names from the top 4 birth years
data <- data %>%
mutate(label = ifelse(birth_year %in% top4, name, NA))
# plots data with label, dropping NAs
data %>%
ggplot(aes(x = mass, y = height, label = label)) +
geom_point() +
geom_label(na.rm = TRUE)

Set ggplot title to reflect dplyr grouping

I've got a grouped dataframe generated in dplyr where each group reflects a unique combination of factor variable levels. I'd like to plot the different groups using code similar to this post. However, I can't figure out how to include two (or more) variables in the title of my plots, which is a hassle since I've got a bunch of different combinations.
Fake data and plotting code:
library(dplyr)
library(ggplot2)
spiris<-iris
spiris$site<-as.factor(rep(c("A","B","C")))
spiris$year<-as.factor(rep(2012:2016))
spiris$treatment<-as.factor(rep(1:2))
g<-spiris %>%
group_by(site, Species) %>%
do(plots=ggplot(data=.) +
aes(x=Petal.Width)+geom_histogram()+
facet_grid(treatment~year))
##Need code for title here
g[[3]] ##view plots
I need the title of each plot to reflect both "site" and "Species". Any ideas?
Use split() %>% purrr::map2() instead of group_by() %>% do() like this:
spiris %>%
split(list(.$site, .$Species)) %>%
purrr::map2(.y = names(.),
~ ggplot(data=., aes(x=Petal.Width)) +
geom_histogram()+
facet_grid(treatment~year) +
labs(title = .y) )
You just need to set the title with ggtitle():
g <- spiris %>% group_by(site, Species) %>% do(plots = ggplot(data = .) +
aes(x = Petal.Width) + geom_histogram() + facet_grid(treatment ~
year) + ggtitle(paste(.$Species,.$site,sep=" - ")))

is the merging of data frames necessary here

I have a data frame and I would like to plot 3 lines all from the "Value"
vector. The First two lines are the value vector grouped by the "group" and the 3rd line is the UNGROUPED value vector. The way I am currently doing it is by doing 2 calls to DPLYR and creating 2 data frames, then merging them and then plotting the merged data frame. Is there an easier way that avoids 2 calls to DPLYR?
d = data.frame(ym = rep(c(20011,20012,20023),3), group = c(0,0,1,0,1,0,1,0,1), value = c(1,2,3,4,2,1,3,3,2))
############### 1st call to dplyr to create plot with 2 lines grouped by "group"
d2 = d %>%
group_by(ym,group) %>%
summarise(
Value = mean(value)
)
d2= as.data.frame(d2)
d2
ggplot(data=d2 , aes(x=ym, y=Value, group=as.factor(group), colour = as.factor(group))) +
geom_line() + geom_point()
###second call to dplyr to create a second data frame just for the UNGROUPED data
d3 = d %>%
group_by(ym) %>%
summarise(
Value = mean(value)
)
#### merge the data TWO frames
d3 =as.data.frame(d3)
d3$group=2
d4 = rbind(d2,d3)
### plot all 3 lines
ggplot(data=d4 , aes(x=ym, y=Value, group=as.factor(group), colour = as.factor(group))) +
geom_line() + geom_point()
You could do it in a single dplyr chain, but (AFAIK) it still requires two separate operations:
d2 = bind_rows(
d %>%
group_by(ym, group=as.character(group)) %>%
summarise(Value = mean(value)),
d %>%
group_by(ym) %>%
summarise(Value = mean(value),
group = "All"))
The code group=as.character(group) is necessary to avoid an error when you add group="All", because bind_rows won't automatically coerce group from numeric to character. (This step is of course unnecessary in cases where the grouping column is already factor or character.)
Then, for plotting you can highlight the average line so that it's separate from the individual groups. We map to shape solely to be able to remove the point markers for the All line:
ggplot(d2 , aes(x=ym, y=Value, colour=group)) +
geom_line(aes(size=group)) +
geom_point(aes(shape=group)) +
scale_color_manual(values=c(hcl(c(15,195),100,65), "black")) +
scale_shape_manual(values=c(16,16,NA)) +
scale_size_manual(values=c(0.7,0.7,1.5))

Resources