How group dataset in a boxplot? - r

I have been trying to figure out how to group 9 datasets into 3 different groups (1, 2, and 3).
I have 3 different data frames that look like this:
ID1 ID2 dN dS Omega Label_ID1 Label_ID2 Group
QJY77946 NP_073551 0.0293 0.0757 0.3872 229E-CoV 229E-CoV Intra
QJY77954 NP_073551 0.0273 0.0745 0.3668 229E-CoV 229E-CoV Intra
...
So, the only columns that I´m interested in are three: dN, dS, and Omega.
My main goal is to take these three columns from my data frames and plots in a boxplot using Rstudio.
To do that, first I take the 3 columns of each data frame with these lines:
dN_1 <- df_1$dN
dS_1 <- df_1$dS
Omega_1 <- df_1$Omega
Then, to generate the plot I use this line (option 1):
boxplot(dN_S, dS_S, Omega_S, dN_M, dS_M, Omega_M, dN_E, dS_E, Omega_E,
main = "Test",
xlab = "Frames",
ylab = "Distribution",
col = "red")
My goal is to group these 9 boxes into 3 separate groups:
I know that using ggplot2 could be easier, so my option 2 is to use these lines (option 2):
df_1 %>%
ggplot(aes(y=dN_S)) +
geom_boxplot(
color = "blue",
fill = "blue",
alpha = 0.2,
notch = T,
notchwidth = 0.8)
However, you can see that I couldn´t find a way to plot all groups in the same plot.
So how can I group my data in the boxplot using option 1 or option 2? Maybe the second option is less development but perhaps someone could help with that too.

library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)
set.seed(123)
df_s <- data.frame(dN = runif(20),
dS = runif(20),
Omega = runif(20))
df_m <- data.frame(dN = runif(20),
dS = runif(20),
Omega = runif(20))
df_e <- data.frame(dN = runif(20),
dS = runif(20),
Omega = runif(20))
df <-
list(df_s, df_m, df_e) %>%
set_names(c("S", "M", "E")) %>%
map_dfr(bind_rows, .id = "df") %>%
pivot_longer(-df)
ggplot(df)+
geom_boxplot(aes(x = name, y = value))+
facet_wrap(~df, nrow = 1)
Created on 2021-09-24 by the reprex package (v2.0.0)

One way to accomplish this is by providing ggplot() another aesthetic, like fill. Here's a small reproducible example:
library(tidyverse)
df <- tibble(category = rep(letters[1:4], 5),
time = c(rep("before", 10), rep("after", 10)),
num = rnorm(20))
df %>%
ggplot() +
geom_boxplot(aes(x=category, y=num, fill = time))
Let me know if you're looking for something else.

Related

Run ANOVA, pairwise test and Compact Letter Display on plots on several dependant variables

I have a dataset that looks like this, but with a few dozen more dependant variables.
set.seed(108)
test <- data.frame(
n = 1:12,
treatment = factor(paste("trt", 1:2)),
type = sample(LETTERS, 3),
var1 = sample(1:100, 12),
var2 = sample(1:100, 12),
var3 = sample(1:100, 12),
var4 = sample(1:100, 12))
I would like to run a two-way ANOVA (effect of treatment and type on each of the dependant variables), and I am trying to do it automatically. Eventually, I'd like to plot barplots of each of the dependant variables including a compact letter display of the significance letters on each of the barplots. The letters would result from the ANOVA and pairwise comparison test, Example: https://statdoe.com/barplot-for-two-factors-in-r/ , section: Adding the compact letter display).
Could somebody give me hints to run this automatically? Or should I just give up and do it manually?
Does this work for you? If so, please also read my notes below.
# packages and function conflicts
library(broom)
library(conflicted)
library(emmeans)
library(multcomp)
library(multcompView)
library(tidyverse)
conflict_prefer("select", winner = "dplyr")
#> [conflicted] Will prefer dplyr::select over any other package
# data
set.seed(108)
test <- data.frame(
n = 1:12,
treatment = factor(paste("trt", 1:2)),
type = sample(LETTERS, 3),
var1 = sample(1:100, 12),
var2 = sample(1:100, 12),
var3 = sample(1:100, 12),
var4 = sample(1:100, 12))
# Loop setup
var_names <- test %>% select(contains("var")) %>% names()
emm_list <- list()
anova_list <- list()
# Loop
for (var_i in var_names) {
test_i <- test %>%
rename(y_i = !!var_i) %>%
select(treatment, type, y_i)
mod_i <- lm(y_i ~ treatment*type, data = test_i)
anova_list[[var_i]] <- mod_i %>% anova %>% tidy()
emm_i <- emmeans(mod_i, ~treatment:type) %>%
cld(Letters = letters)
emm_list[[var_i]] <- as_tibble(emm_i)
}
# Combine emmean results
emm_out <- bind_rows(emm_list, .id = "id")
# Plot results
ggplot() +
facet_wrap(~ id) +
theme_classic() +
# mean
geom_point(
data = emm_out,
aes(y = emmean, x = type, color = treatment),
shape = 16,
alpha = 0.5,
position = position_dodge(width = 0.2)
) +
# red mean errorbar
geom_errorbar(
data = emm_out,
aes(ymin = lower.CL, ymax = upper.CL, x = type, color = treatment),
width = 0.05,
position = position_dodge(width = 0.2)
) +
# red letters
geom_text(
data = emm_out,
aes(
y = emmean,
x = type,
color = treatment,
label = str_trim(.group)
),
position = position_nudge(x = 0.1),
hjust = 0,
show.legend = FALSE
)
Created on 2022-09-01 with reprex v2.0.2
First of all, this question is not a duplicate of, but similar to the
question "loop Tukey post hoc letters
extraction".
Then note that you are talking about two separate things: ANOVA results and pairwise comparison test results. The former is stored in anova_list and the second in emm_list/emm_out. Only the second can get you the compact letter display and I did that here via cld(). Pleas see this chapter on compact letter displays for more details. Note that in that chapter I also provide code for bar charts instead of the one above, but also explain why bar charts may not be the best choice.
The topic of compact letter displays is a bit more complex when you have two factors and especially interactions between them. I was assuming that you always want to look at the interaction term, but maybe check out this answer for details on what your options are.

How to specify groups with colors in qqplot()?

I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?
A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output
For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result

Plotting normal distributions in a ridgeline plot with ggridges

I'm a little embarrassed to ask this question but I've spent the better part of my work day trying to find a solution, yet and here I am...
What I'm aiming for is a simple ridgeline plot of several normal distributions which are calculated from given means and SDs in my data, like in this example:
case_number caseMean caseSD
case1 0 1
case2 1 2
case3 3 3
All the examples I've found are working with series of measurement, like in the example with the temperatures in Lincoln, NE:
Example of ridgeline plot
https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html and I cannot get them to work.
As to my experience with R, I am not a complete idiot when it comes to data analysis but proper visualization is something I am eager to learn but unfortunately I need a solution to my problem rather.
Thank you very much for your help!
Edit -- added precise theoretical answer.
Here's a way using dnorm to construct exact normal curves to those specifications:
library(tidyverse); library(ggridges)
n = 100
df3 <- df %>%
mutate(low = caseMean - 3 * caseSD, high = caseMean + 3 * caseSD) %>%
uncount(n, .id = "row") %>%
mutate(x = (1 - row/n) * low + row/n * high,
norm = dnorm(x, caseMean, caseSD))
ggplot(df3, aes(x, case_number, height = norm)) +
geom_ridgeline(scale = 3)
Similar to Sada93's answer, using dplyr and tidyr:
library(tidyverse); library(ggridges)
n = 50000
df2 <- df %>%
uncount(n) %>%
mutate(value = rnorm(n(), caseMean, caseSD))
ggplot(df2, aes(x = value, y = case_number)) + geom_density_ridges()
sample data:
df <- read.table(
header = T,
stringsAsFactors = F,
text = "case_number caseMean caseSD
case1 0 1
case2 1 2
case3 3 3")
You need to create a new data frame with the actual distribution values and then use ggridges as follows,
library(ggplot2)
library(ggridges)
data = data.frame(case = c("case1","case2","case3"),caseMean = c(0,1,3),caseSD = c(1,2,3))
#Create 100 rows for each mean and SD
data_plot = data.frame(case = character(),value = numeric())
n = 100
for(i in 1:nrow(data)){
case = data$case[i]
mean = data$caseMean[i]
sd = data$caseSD[i]
val = rnorm(n,mean,sd)
data_plot = rbind(data_plot,
data.frame(case = rep(case,n),
value = val))
}
ggplot(data = data_plot,aes(x = value,y = case))+geom_density_ridges()

Plot NA counts in a histogram

I have a question related to the histograms in R using ggplot2. I have been working trying to represent some values in a histogram from two different variables. After trying and looking for some solutions in Stackoverflow I got it but...does somebody know how to print NAs count as a new column just to compare the missings in the two variables?
Here is the R code:
i<-"ADL_1_bathing"
j<-"ADL_1_T2_bathing"
t1<-data.frame(datosMedicos[,i])
colnames(t1)<-"datos"
t2<-data.frame(datosMedicos[,j])
colnames(t2)<-"datos"
t1$time<-"t1"
t2$time<-"t2"
juntarParaGrafico<-rbind(t1,t2)
ggplot(juntarParaGrafico, aes(datos, fill = time) ) +
geom_histogram(col="darkblue",alpha = 0.5, aes(y = ..count..), binwidth = 0.2, position = 'dodge', na.rm = F) +
theme(legend.justification = c(1, 1), legend.position=c(1, 1))+
labs(title=paste0("Distribution of ",i), x=i, y="Count")
And this is the output:
Image about the two variables values but without the missing bars:
you could try to summarise the number of NAs b4 plotting. How about this?
library(ggplot2)
library(dplyr)
df1 = data.frame(a = rnorm(1:20))
df1[sample(1:20, 5),] = NA
df2 = data.frame(a = rnorm(1:20))
df2[sample(1:20, 3),] = NA
df2$time = "t2"
df1$time = "t1"
df = rbind(df1, df2)
df %>% group_by(time) %>% summarise(numNAs = sum(is.na(a)))
histogramDF= df %>% group_by(time) %>% summarise(numNAs = sum(is.na(a)))
qplot(x=time, y = numNAs, fill=time, data = histogramDF, stat='identity', geom="histogram")

Create heatmap with values from matrix in ggplot2

I've seen heatmaps with values made in various R graphics systems including lattice and base like this:
I tend to use ggplot2 a bit and would like to be able to make a heatmap with the corresponding cell values plotted. Here's the heat map and an attempt using geom_text:
library(reshape2, ggplot2)
dat <- matrix(rnorm(100, 3, 1), ncol=10)
names(dat) <- paste("X", 1:10)
dat2 <- melt(dat, id.var = "X1")
p1 <- ggplot(dat2, aes(as.factor(Var1), Var2, group=Var2)) +
geom_tile(aes(fill = value)) +
scale_fill_gradient(low = "white", high = "red")
p1
#attempt
labs <- c(apply(round(dat[, -2], 1), 2, as.character))
p1 + geom_text(aes(label=labs), size=1)
Normally I can figure out the x and y values to pass but I don't know in this case since this info isn't stored in the data set. How can I place the text on the heatmap?
Key is to add a row identifier to the data and shape it "longer".
edit Dec 2022 to make code reproducible with R 4.2.2 / ggplot2 3.4.0 and reflect changes in tidyverse semantics
library(ggplot2)
library(tidyverse)
dat <- matrix(rnorm(100, 3, 1), ncol = 10)
## the matrix needs names
names(dat) <- paste("X", 1:10)
## convert to tibble, add row identifier, and shape "long"
dat2 <-
dat %>%
as_tibble() %>%
rownames_to_column("Var1") %>%
pivot_longer(-Var1, names_to = "Var2", values_to = "value") %>%
mutate(
Var1 = factor(Var1, levels = 1:10),
Var2 = factor(gsub("V", "", Var2), levels = 1:10)
)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
ggplot(dat2, aes(Var1, Var2)) +
geom_tile(aes(fill = value)) +
geom_text(aes(label = round(value, 1))) +
scale_fill_gradient(low = "white", high = "red")
Created on 2022-12-31 with reprex v2.0.2
There is another simpler way to make heatmaps with values. You can use pheatmap to do this.
dat <- matrix(rnorm(100, 3, 1), ncol=10)
names(dat) <- paste("X", 1:10)
install.packages('pheatmap') # if not installed already
library(pheatmap)
pheatmap(dat, display_numbers = T)
This will give you a plot like this
If you want to remove clustering and use your color scheme you can do
pheatmap(dat, display_numbers = T, color = colorRampPalette(c('white','red'))(100), cluster_rows = F, cluster_cols = F, fontsize_number = 15)
You can also change the fontsize, format, and color of the displayed numbers.

Resources