How to group a single column based on two variables in r - r

I have a data frame with four columns in the following format:
column 1: 0 Yellow (8 observations)
column 2: 0 Purple (9 observations)
column 3: 6 Yellow (11 observations)
column 4: 6 Purple (12 observations)
Yellow_0 <- c(2,5,6,2,6,4,35,6,NA,NA,NA,NA)
Purple_0 <- c(12,34,34,54,23,33,2,12,23,NA,NA,NA)
Yellow_6 <- c(31,23,4,5,56,43,18,33,5,23,33,NA)
Purple_6 <- c(23,5,23,33,45,66,12,23,2,2,23,24)
I want to group by both time (0 & 6) and colour (Yellow and Purple). I tried the following code (after importing the csv file) which groups the variables by both time and colour (4 groups in total) instead of grouping by time (two groups) and colour(two groups).
library(tidyverse)
library(reshape2)
DF <- melt(DF, na.rm = TRUE)
DF <- DF %>% group_by(variable)
a <- ggplot(DF, aes(x=variable, y= value, colour = variable)) + geom_boxplot()
How do I perform "group_by" function on both Time and Colour?

We could separate the column names into two after reshaping to 'long' format
library(dplyr)
library(tidyr)
library(ggplot2)
DF %>%
pivot_longer(cols = everything(), values_drop_na = TRUE) %>%
separate(name, into = c('color', 'time')) %>%
ggplot(aes(x = time, y = value, fill = color)) +
geom_boxplot() +
facet_wrap(~ color)
data
DF <- data.frame(Yellow_0, Purple_0, Yellow_6, Purple_6)

Related

lapply on a list of dataframe to create plots

I have created a list of dataframe df_list, each of which is an 18 x 4 dataframe.
The first columns of the dataframe is 18-times-repeated gene name, and the rest three columns are the gene's information. Each dataframe describes different gene.
Now I'd like to iterate the list of dataframe (i.e, a list of gene and their respective information) over the boxplot, to get plots on each gene; however, I am not sure how to deal with the ggtitle below:
Here is my simplified boxplot function:
box <- function(df){
df %>%
ggplot(df, aes(x = df[,4], y = df[,2])) +
geom_boxplot() +
ggtitle(g)
}
g is the gene name in each dataframe in the df_list
and when I run lapply(df_list,box),
I got Error: Mapping should be created with aes()oraes_().
Does anyone know how to fix this? Thank you.
As you used dplyr pipe, the first argument of ggplot() is already filled, leading df to be understood as aes argument.
box <- function(df){
df %>%
ggplot(aes(x = df[,4], y = df[,2])) +
geom_boxplot() +
ggtitle(g)
}
Instead of using df[,4] or df[,2] use column names in aes. Assuming the column names on x-axis is col1 and that on y-axis is col2 try -
box <- function(df, g){
df %>%
ggplot(aes(x = col1, y = col2)) +
geom_boxplot() +
ggtitle(df$g[1])
}
lapply(df_list,box)
g is the column name in the dataframe, so we can take the first value from it in title.
This is how to do it with another example dataset (Species instead of genes):
library(tidyverse)
plots <-
iris %>%
pivot_longer(-Species) %>%
nest(-Species) %>%
mutate(
plt = data %>% map2(Species, ~ {
.x %>%
ggplot(aes(name, value)) +
geom_boxplot() +
labs(title = .y)
})
) %>%
pull(plt)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(name, value)`?
plots[[1]]
Created on 2021-09-09 by the reprex package (v2.0.1)

summarise column and add it's common values in R

I have a dataframe something like this and my end goal is to make a bar chart.
Here is the data frame.
a 5
a 7
b 23
b 12
c 21
c 21
c 27
I want to summarize the dataframe with the first column but want to add the values of the 2nd column and make a bar chart for the values of 2nd column. The resulting data frame should be :
a 12
b 35
c 69
I tried something like this but it does not work:
d %>%
group_by(V1) %>%
summarise(V2) %>%
ggplot(aes(x = V1, y = V2)) + geom_col()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
A simple base R option using barplot + aggregate
barplot(SumValue ~ ., aggregate(cbind(SumValue = Value) ~ ., df, sum))
Seems to be pretty straightforward. Let me know if this helps.
library(dplyr)
library(ggplot2)
#Converting your values into a dataframe
data <- data.frame("Key" = c("a","a","b","b","c","c","c"), "Value" = c(5,7,23,12,21,21,27))
data <- data %>%
group_by(Key) %>%
summarise(Value = sum(Value))
#Plot
ggplot(data, aes(x=Key, y=Value))+
geom_bar(stat="identity")

Plot percentages in R as blocks

I have the table to the left
table <- cbind(c("x1","x2", "x3"), c("0.4173","0.9211","0.0109"))
and is trying to make the plot two the right.
Is there any packages in R, which can do, what I'm trying to achieve?
A base R, option would be to use barplot applied on a named vector
barplot(v1)
Or convert to two column data.frame with stack and use the formula method
barplot(values ~ ind, stack(v1))
Or we can can use tidyverse with ggplot
library(dplyr)
library(ggplot2)
library(tidyr)
library(tibble)
enframe(v1, name = "id", value = 'block') %>%
mutate(non_block = 1 - block) %>%
pivot_longer(cols = -id) %>%
ggplot(aes(x = id, y = value, fill = name)) +
geom_col() +
coord_flip() +
theme_bw()
-output
data
v1 <- setNames(c(0.4173, 0.9211, 0.0109), paste0("x", 1:3))

bicolor heatmap with factor levels

I have this dataframe:
set.seed(0)
df <- data.frame(id = factor(sample(1:100, 10000, replace=TRUE), levels=1:100),
year = factor(sample(1950:2019, 10000, replace=TRUE), levels=1950:2019)) %>% unique() %>% arrange(id, year)
And I'm looking to plot a heatmap graph where the ids are in the X-axis, years at the Y-axis, and the color is blue when the data point exists and the color is red when the data doesn't exist. I'm almost there, but I can't figure out to change the fill argument for the two colors:
ggplot(df, aes(id, year, fill= year)) +
geom_tile()
The objective to plot both variables as factors is to plot them even when some year doesn't have any id (and plotting its whole row as red).
EDIT:
Two things I forgot to add (hope it's not too late):
How to add alpha transparency to geom_tile() without messing it?
I need to sort the ids from maximum missings to minimum missings.
The complete() function from the tidyr package is useful for filling in missing combinations. First, you need to set a flag variable to indicate if the data is present or not, and then expand the data frame with the missing combinations and fill the new flag variable with 0:
df <- df %>%
mutate(flag = TRUE) %>%
complete(id, year, fill = list(flag = FALSE))
ggplot(df, aes(id, year, fill = flag)) +
geom_tile()
EDIT1: To add transparency, add alpha = 0.x within geom_tile(), where x is a value indicating the transparency. The lower the value, the more transparent.
EDIT2: To sort by missingness add the following code prior to the ggplot code:
# Determine the order of the IDs
df_order <- df %>%
group_by(id) %>%
summarize(sum = sum(flag)) %>%
arrange(desc(sum)) %>%
mutate(order = row_number()) %>%
select(id, order)
# Set the IDs in order on the chart
df <- df %>%
left_join(df_order) %>%
mutate(id = fct_reorder(id, order))
I think you need to do some pre-processing before plotting. Create a temporary variable (data_exist) which denotes data is present for that id and year. Then use complete to fill the missing years for each id and plot it.
library(tidyverse)
df %>%
mutate_all(~as.integer(as.character(.))) %>%
mutate(data_exist = 1) %>%
complete(id, year = min(year):max(year), fill = list(data_exist = 0)) %>%
mutate(data_exist = factor(data_exist)) %>%
ggplot() + aes(id, year, fill= data_exist) + geom_tile()
With expand.gridyou can create a dataframe with all combinations of ids and years, then left join on this combinations to see if you had them in df
all <- expand.grid(id=levels(df$id),year=levels(df$year)) %>%
left_join(df) %>%
mutate(present=ifelse(is.na(present),'0','1'))
ggplot(all, aes(as.numeric(id), as.numeric(year), fill= present)) +
geom_tile() +
scale_fill_manual(values=c('0'='red','1'='blue')) + # change default colors
theme(legend.position="None") # hide legend

is the merging of data frames necessary here

I have a data frame and I would like to plot 3 lines all from the "Value"
vector. The First two lines are the value vector grouped by the "group" and the 3rd line is the UNGROUPED value vector. The way I am currently doing it is by doing 2 calls to DPLYR and creating 2 data frames, then merging them and then plotting the merged data frame. Is there an easier way that avoids 2 calls to DPLYR?
d = data.frame(ym = rep(c(20011,20012,20023),3), group = c(0,0,1,0,1,0,1,0,1), value = c(1,2,3,4,2,1,3,3,2))
############### 1st call to dplyr to create plot with 2 lines grouped by "group"
d2 = d %>%
group_by(ym,group) %>%
summarise(
Value = mean(value)
)
d2= as.data.frame(d2)
d2
ggplot(data=d2 , aes(x=ym, y=Value, group=as.factor(group), colour = as.factor(group))) +
geom_line() + geom_point()
###second call to dplyr to create a second data frame just for the UNGROUPED data
d3 = d %>%
group_by(ym) %>%
summarise(
Value = mean(value)
)
#### merge the data TWO frames
d3 =as.data.frame(d3)
d3$group=2
d4 = rbind(d2,d3)
### plot all 3 lines
ggplot(data=d4 , aes(x=ym, y=Value, group=as.factor(group), colour = as.factor(group))) +
geom_line() + geom_point()
You could do it in a single dplyr chain, but (AFAIK) it still requires two separate operations:
d2 = bind_rows(
d %>%
group_by(ym, group=as.character(group)) %>%
summarise(Value = mean(value)),
d %>%
group_by(ym) %>%
summarise(Value = mean(value),
group = "All"))
The code group=as.character(group) is necessary to avoid an error when you add group="All", because bind_rows won't automatically coerce group from numeric to character. (This step is of course unnecessary in cases where the grouping column is already factor or character.)
Then, for plotting you can highlight the average line so that it's separate from the individual groups. We map to shape solely to be able to remove the point markers for the All line:
ggplot(d2 , aes(x=ym, y=Value, colour=group)) +
geom_line(aes(size=group)) +
geom_point(aes(shape=group)) +
scale_color_manual(values=c(hcl(c(15,195),100,65), "black")) +
scale_shape_manual(values=c(16,16,NA)) +
scale_size_manual(values=c(0.7,0.7,1.5))

Resources