is the merging of data frames necessary here - r

I have a data frame and I would like to plot 3 lines all from the "Value"
vector. The First two lines are the value vector grouped by the "group" and the 3rd line is the UNGROUPED value vector. The way I am currently doing it is by doing 2 calls to DPLYR and creating 2 data frames, then merging them and then plotting the merged data frame. Is there an easier way that avoids 2 calls to DPLYR?
d = data.frame(ym = rep(c(20011,20012,20023),3), group = c(0,0,1,0,1,0,1,0,1), value = c(1,2,3,4,2,1,3,3,2))
############### 1st call to dplyr to create plot with 2 lines grouped by "group"
d2 = d %>%
group_by(ym,group) %>%
summarise(
Value = mean(value)
)
d2= as.data.frame(d2)
d2
ggplot(data=d2 , aes(x=ym, y=Value, group=as.factor(group), colour = as.factor(group))) +
geom_line() + geom_point()
###second call to dplyr to create a second data frame just for the UNGROUPED data
d3 = d %>%
group_by(ym) %>%
summarise(
Value = mean(value)
)
#### merge the data TWO frames
d3 =as.data.frame(d3)
d3$group=2
d4 = rbind(d2,d3)
### plot all 3 lines
ggplot(data=d4 , aes(x=ym, y=Value, group=as.factor(group), colour = as.factor(group))) +
geom_line() + geom_point()

You could do it in a single dplyr chain, but (AFAIK) it still requires two separate operations:
d2 = bind_rows(
d %>%
group_by(ym, group=as.character(group)) %>%
summarise(Value = mean(value)),
d %>%
group_by(ym) %>%
summarise(Value = mean(value),
group = "All"))
The code group=as.character(group) is necessary to avoid an error when you add group="All", because bind_rows won't automatically coerce group from numeric to character. (This step is of course unnecessary in cases where the grouping column is already factor or character.)
Then, for plotting you can highlight the average line so that it's separate from the individual groups. We map to shape solely to be able to remove the point markers for the All line:
ggplot(d2 , aes(x=ym, y=Value, colour=group)) +
geom_line(aes(size=group)) +
geom_point(aes(shape=group)) +
scale_color_manual(values=c(hcl(c(15,195),100,65), "black")) +
scale_shape_manual(values=c(16,16,NA)) +
scale_size_manual(values=c(0.7,0.7,1.5))

Related

lapply on a list of dataframe to create plots

I have created a list of dataframe df_list, each of which is an 18 x 4 dataframe.
The first columns of the dataframe is 18-times-repeated gene name, and the rest three columns are the gene's information. Each dataframe describes different gene.
Now I'd like to iterate the list of dataframe (i.e, a list of gene and their respective information) over the boxplot, to get plots on each gene; however, I am not sure how to deal with the ggtitle below:
Here is my simplified boxplot function:
box <- function(df){
df %>%
ggplot(df, aes(x = df[,4], y = df[,2])) +
geom_boxplot() +
ggtitle(g)
}
g is the gene name in each dataframe in the df_list
and when I run lapply(df_list,box),
I got Error: Mapping should be created with aes()oraes_().
Does anyone know how to fix this? Thank you.
As you used dplyr pipe, the first argument of ggplot() is already filled, leading df to be understood as aes argument.
box <- function(df){
df %>%
ggplot(aes(x = df[,4], y = df[,2])) +
geom_boxplot() +
ggtitle(g)
}
Instead of using df[,4] or df[,2] use column names in aes. Assuming the column names on x-axis is col1 and that on y-axis is col2 try -
box <- function(df, g){
df %>%
ggplot(aes(x = col1, y = col2)) +
geom_boxplot() +
ggtitle(df$g[1])
}
lapply(df_list,box)
g is the column name in the dataframe, so we can take the first value from it in title.
This is how to do it with another example dataset (Species instead of genes):
library(tidyverse)
plots <-
iris %>%
pivot_longer(-Species) %>%
nest(-Species) %>%
mutate(
plt = data %>% map2(Species, ~ {
.x %>%
ggplot(aes(name, value)) +
geom_boxplot() +
labs(title = .y)
})
) %>%
pull(plt)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(name, value)`?
plots[[1]]
Created on 2021-09-09 by the reprex package (v2.0.1)

Splitting a dataframe by every n unique values of a variable

I have a dataframe of Lots, Time, Value with the same structure as the sample data below.
df <- tibble(Lot = c(rep(123,4),rep(265,5),rep(132,3),rep(455,4)),
time = c(seq(4), seq(5), seq(3), seq(4)), Value = runif(16))
I'd like to split the dataframe by every N Lots and plot them. The Lots are different sizes so I can't subset the data by every n rows!
I've been using an approach like this but it's not scalable for a large dataset.
df %>% filter(Lot == c(123, 265)) %>% ggplot(., aes(x = time, y = Value)) +
geom_point() + stat_smooth()
How can I do this?
Create a lot number column and create a list of plots for every n unique lot values.
This would give you list of plots.
library(tidyverse)
lot_n <- 2
df %>%
mutate(Lot_number = match(Lot, unique(Lot)),
group = ceiling(Lot_number/lot_n)) %>%
group_split(group) %>%
map(~ggplot(.x, aes(x = time, y = Value)) +
geom_point() + stat_smooth()) -> list_plots
list_plots
Individual plots can be accessed via list_plots[[1]], list_plots[[2]] etc.
You can also plot the data with facets.
df %>%
mutate(Lot_number = match(Lot, unique(Lot)),
group = ceiling(Lot_number/lot_n)) %>%
ggplot(aes(x = time, y = Value)) +
geom_point() + stat_smooth() +
facet_wrap(~group, scales = 'free')

ifelse condition: is in top n

Usually when i need a subset on geom_label() i use ifelse() and i specify a number as below:
library(tidyverse)
data = starwars %>% filter(mass < 500)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year > 100, name, NA))) +
geom_point() +
geom_label()
#> Warning: Removed 54 rows containing missing values (geom_label).
Created on 2020-05-31 by the reprex package (v0.3.0)
But with the dataset i'm working on, i need a dynamic solution, something like ifelse("birth_year is in top n", name, NA).
Thoughts?
For your method, I think using rank should work fine, e.g.,
ifelse(rank(birth_year) < 10, name, NA))
You can use rank(-birth_year) if you want it sorted the other way (or, if you're using dplyr, rank(desc(birth_year)), which will work on non-numeric columns too). You may want to read up on tie methods at ?rank.
I'd also propose a more general solution: filtering data for the geom_label layer. For more complex conditions (e.g., where a group_by would come in handy) it will be more straightforward:
data %>%
ggplot(aes(x = mass, y = height, label = name)) +
geom_point() +
geom_label(
data = data %>%
group_by(species) %>%
top_n(n = 1, wt = desc(birth_year)) # youngest of each species
)
Something like this? To get top 4 values.
library(ggplot2)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year >= sort(birth_year, decreasing = TRUE)[4], name, NA))) +
geom_point() +
geom_label()
This is a more explicit approach. I assume you want to count the number of characters per birth year, per your example. In this case, we handle the ranking first, then add a column to the original dataset, then plot. The new 'label' field is either blank/NA or has members of the top set. I suppress the pesky missing data warning in the geom_label arguments.
data = starwars %>% filter(mass < 500)
# counts names per birthyear, returns vector of top 4
top4 <- data %>%
drop_na(birth_year) %>%
count(birth_year, sort = TRUE) %>%
top_n(4) %>%
pull(birth_year)
# adds column to data with the names from the top 4 birth years
data <- data %>%
mutate(label = ifelse(birth_year %in% top4, name, NA))
# plots data with label, dropping NAs
data %>%
ggplot(aes(x = mass, y = height, label = label)) +
geom_point() +
geom_label(na.rm = TRUE)

How to group a single column based on two variables in r

I have a data frame with four columns in the following format:
column 1: 0 Yellow (8 observations)
column 2: 0 Purple (9 observations)
column 3: 6 Yellow (11 observations)
column 4: 6 Purple (12 observations)
Yellow_0 <- c(2,5,6,2,6,4,35,6,NA,NA,NA,NA)
Purple_0 <- c(12,34,34,54,23,33,2,12,23,NA,NA,NA)
Yellow_6 <- c(31,23,4,5,56,43,18,33,5,23,33,NA)
Purple_6 <- c(23,5,23,33,45,66,12,23,2,2,23,24)
I want to group by both time (0 & 6) and colour (Yellow and Purple). I tried the following code (after importing the csv file) which groups the variables by both time and colour (4 groups in total) instead of grouping by time (two groups) and colour(two groups).
library(tidyverse)
library(reshape2)
DF <- melt(DF, na.rm = TRUE)
DF <- DF %>% group_by(variable)
a <- ggplot(DF, aes(x=variable, y= value, colour = variable)) + geom_boxplot()
How do I perform "group_by" function on both Time and Colour?
We could separate the column names into two after reshaping to 'long' format
library(dplyr)
library(tidyr)
library(ggplot2)
DF %>%
pivot_longer(cols = everything(), values_drop_na = TRUE) %>%
separate(name, into = c('color', 'time')) %>%
ggplot(aes(x = time, y = value, fill = color)) +
geom_boxplot() +
facet_wrap(~ color)
data
DF <- data.frame(Yellow_0, Purple_0, Yellow_6, Purple_6)

R: using ggplot2 with a group_by data set

I can't quite figure this out. A CSV of 200+ rows assigned to data like so:
gid,bh,p1_id,p1_x,p1_y
90467,R,543333,80.184,98.824
90467,L,408045,74.086,90.923
90467,R,543333,57.629,103.797
90467,L,408045,58.589,95.937
Trying to group by p1_id and plot the mean values for p1_x and p1_y:
grp <- data %>% group_by(p1_id)
Trying to plot geom_point objects like so:
geom_point(aes(mean(grp$p1_x), mean(grp$p1_y), color=grp$p1_id))
But that isn't showing unique plot points per distinct p1_id values.
What's the missing step here?
Why not calculate the mean first?
library(dplyr)
grp <- data %>%
group_by(p1_id) %>%
summarise(mean_p1x = mean(p1_x),
mean_p1y = mean(p1_y))
Then plot:
library(ggplot2)
ggplot(grp, aes(x = mean_p1x, y = mean_p1y)) +
geom_point(aes(color = as.factor(p1_id)))
Edit: As per #eipi10, you can also pipe directly into ggplot
data %>%
group_by(p1_id) %>%
summarise(mean_p1x = mean(p1_x),
mean_p1y = mean(p1_y)) %>%
ggplot(aes(x = mean_p1x, y = mean_p1y)) +
geom_point(aes(color = as.factor(p1_id)))

Resources