I have been trying to figure out how to group 9 datasets into 3 different groups (1, 2, and 3).
I have 3 different data frames that look like this:
ID1 ID2 dN dS Omega Label_ID1 Label_ID2 Group
QJY77946 NP_073551 0.0293 0.0757 0.3872 229E-CoV 229E-CoV Intra
QJY77954 NP_073551 0.0273 0.0745 0.3668 229E-CoV 229E-CoV Intra
...
So, the only columns that I´m interested in are three: dN, dS, and Omega.
My main goal is to take these three columns from my data frames and plots in a boxplot using Rstudio.
To do that, first I take the 3 columns of each data frame with these lines:
dN_1 <- df_1$dN
dS_1 <- df_1$dS
Omega_1 <- df_1$Omega
Then, to generate the plot I use this line (option 1):
boxplot(dN_S, dS_S, Omega_S, dN_M, dS_M, Omega_M, dN_E, dS_E, Omega_E,
main = "Test",
xlab = "Frames",
ylab = "Distribution",
col = "red")
My goal is to group these 9 boxes into 3 separate groups:
I know that using ggplot2 could be easier, so my option 2 is to use these lines (option 2):
df_1 %>%
ggplot(aes(y=dN_S)) +
geom_boxplot(
color = "blue",
fill = "blue",
alpha = 0.2,
notch = T,
notchwidth = 0.8)
However, you can see that I couldn´t find a way to plot all groups in the same plot.
So how can I group my data in the boxplot using option 1 or option 2? Maybe the second option is less development but perhaps someone could help with that too.
library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)
set.seed(123)
df_s <- data.frame(dN = runif(20),
dS = runif(20),
Omega = runif(20))
df_m <- data.frame(dN = runif(20),
dS = runif(20),
Omega = runif(20))
df_e <- data.frame(dN = runif(20),
dS = runif(20),
Omega = runif(20))
df <-
list(df_s, df_m, df_e) %>%
set_names(c("S", "M", "E")) %>%
map_dfr(bind_rows, .id = "df") %>%
pivot_longer(-df)
ggplot(df)+
geom_boxplot(aes(x = name, y = value))+
facet_wrap(~df, nrow = 1)
Created on 2021-09-24 by the reprex package (v2.0.0)
One way to accomplish this is by providing ggplot() another aesthetic, like fill. Here's a small reproducible example:
library(tidyverse)
df <- tibble(category = rep(letters[1:4], 5),
time = c(rep("before", 10), rep("after", 10)),
num = rnorm(20))
df %>%
ggplot() +
geom_boxplot(aes(x=category, y=num, fill = time))
Let me know if you're looking for something else.
Related
I have a dataset that looks like this, but with a few dozen more dependant variables.
set.seed(108)
test <- data.frame(
n = 1:12,
treatment = factor(paste("trt", 1:2)),
type = sample(LETTERS, 3),
var1 = sample(1:100, 12),
var2 = sample(1:100, 12),
var3 = sample(1:100, 12),
var4 = sample(1:100, 12))
I would like to run a two-way ANOVA (effect of treatment and type on each of the dependant variables), and I am trying to do it automatically. Eventually, I'd like to plot barplots of each of the dependant variables including a compact letter display of the significance letters on each of the barplots. The letters would result from the ANOVA and pairwise comparison test, Example: https://statdoe.com/barplot-for-two-factors-in-r/ , section: Adding the compact letter display).
Could somebody give me hints to run this automatically? Or should I just give up and do it manually?
Does this work for you? If so, please also read my notes below.
# packages and function conflicts
library(broom)
library(conflicted)
library(emmeans)
library(multcomp)
library(multcompView)
library(tidyverse)
conflict_prefer("select", winner = "dplyr")
#> [conflicted] Will prefer dplyr::select over any other package
# data
set.seed(108)
test <- data.frame(
n = 1:12,
treatment = factor(paste("trt", 1:2)),
type = sample(LETTERS, 3),
var1 = sample(1:100, 12),
var2 = sample(1:100, 12),
var3 = sample(1:100, 12),
var4 = sample(1:100, 12))
# Loop setup
var_names <- test %>% select(contains("var")) %>% names()
emm_list <- list()
anova_list <- list()
# Loop
for (var_i in var_names) {
test_i <- test %>%
rename(y_i = !!var_i) %>%
select(treatment, type, y_i)
mod_i <- lm(y_i ~ treatment*type, data = test_i)
anova_list[[var_i]] <- mod_i %>% anova %>% tidy()
emm_i <- emmeans(mod_i, ~treatment:type) %>%
cld(Letters = letters)
emm_list[[var_i]] <- as_tibble(emm_i)
}
# Combine emmean results
emm_out <- bind_rows(emm_list, .id = "id")
# Plot results
ggplot() +
facet_wrap(~ id) +
theme_classic() +
# mean
geom_point(
data = emm_out,
aes(y = emmean, x = type, color = treatment),
shape = 16,
alpha = 0.5,
position = position_dodge(width = 0.2)
) +
# red mean errorbar
geom_errorbar(
data = emm_out,
aes(ymin = lower.CL, ymax = upper.CL, x = type, color = treatment),
width = 0.05,
position = position_dodge(width = 0.2)
) +
# red letters
geom_text(
data = emm_out,
aes(
y = emmean,
x = type,
color = treatment,
label = str_trim(.group)
),
position = position_nudge(x = 0.1),
hjust = 0,
show.legend = FALSE
)
Created on 2022-09-01 with reprex v2.0.2
First of all, this question is not a duplicate of, but similar to the
question "loop Tukey post hoc letters
extraction".
Then note that you are talking about two separate things: ANOVA results and pairwise comparison test results. The former is stored in anova_list and the second in emm_list/emm_out. Only the second can get you the compact letter display and I did that here via cld(). Pleas see this chapter on compact letter displays for more details. Note that in that chapter I also provide code for bar charts instead of the one above, but also explain why bar charts may not be the best choice.
The topic of compact letter displays is a bit more complex when you have two factors and especially interactions between them. I was assuming that you always want to look at the interaction term, but maybe check out this answer for details on what your options are.
I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?
A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output
For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result
I'm a little embarrassed to ask this question but I've spent the better part of my work day trying to find a solution, yet and here I am...
What I'm aiming for is a simple ridgeline plot of several normal distributions which are calculated from given means and SDs in my data, like in this example:
case_number caseMean caseSD
case1 0 1
case2 1 2
case3 3 3
All the examples I've found are working with series of measurement, like in the example with the temperatures in Lincoln, NE:
Example of ridgeline plot
https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html and I cannot get them to work.
As to my experience with R, I am not a complete idiot when it comes to data analysis but proper visualization is something I am eager to learn but unfortunately I need a solution to my problem rather.
Thank you very much for your help!
Edit -- added precise theoretical answer.
Here's a way using dnorm to construct exact normal curves to those specifications:
library(tidyverse); library(ggridges)
n = 100
df3 <- df %>%
mutate(low = caseMean - 3 * caseSD, high = caseMean + 3 * caseSD) %>%
uncount(n, .id = "row") %>%
mutate(x = (1 - row/n) * low + row/n * high,
norm = dnorm(x, caseMean, caseSD))
ggplot(df3, aes(x, case_number, height = norm)) +
geom_ridgeline(scale = 3)
Similar to Sada93's answer, using dplyr and tidyr:
library(tidyverse); library(ggridges)
n = 50000
df2 <- df %>%
uncount(n) %>%
mutate(value = rnorm(n(), caseMean, caseSD))
ggplot(df2, aes(x = value, y = case_number)) + geom_density_ridges()
sample data:
df <- read.table(
header = T,
stringsAsFactors = F,
text = "case_number caseMean caseSD
case1 0 1
case2 1 2
case3 3 3")
You need to create a new data frame with the actual distribution values and then use ggridges as follows,
library(ggplot2)
library(ggridges)
data = data.frame(case = c("case1","case2","case3"),caseMean = c(0,1,3),caseSD = c(1,2,3))
#Create 100 rows for each mean and SD
data_plot = data.frame(case = character(),value = numeric())
n = 100
for(i in 1:nrow(data)){
case = data$case[i]
mean = data$caseMean[i]
sd = data$caseSD[i]
val = rnorm(n,mean,sd)
data_plot = rbind(data_plot,
data.frame(case = rep(case,n),
value = val))
}
ggplot(data = data_plot,aes(x = value,y = case))+geom_density_ridges()
I have a question related to the histograms in R using ggplot2. I have been working trying to represent some values in a histogram from two different variables. After trying and looking for some solutions in Stackoverflow I got it but...does somebody know how to print NAs count as a new column just to compare the missings in the two variables?
Here is the R code:
i<-"ADL_1_bathing"
j<-"ADL_1_T2_bathing"
t1<-data.frame(datosMedicos[,i])
colnames(t1)<-"datos"
t2<-data.frame(datosMedicos[,j])
colnames(t2)<-"datos"
t1$time<-"t1"
t2$time<-"t2"
juntarParaGrafico<-rbind(t1,t2)
ggplot(juntarParaGrafico, aes(datos, fill = time) ) +
geom_histogram(col="darkblue",alpha = 0.5, aes(y = ..count..), binwidth = 0.2, position = 'dodge', na.rm = F) +
theme(legend.justification = c(1, 1), legend.position=c(1, 1))+
labs(title=paste0("Distribution of ",i), x=i, y="Count")
And this is the output:
Image about the two variables values but without the missing bars:
you could try to summarise the number of NAs b4 plotting. How about this?
library(ggplot2)
library(dplyr)
df1 = data.frame(a = rnorm(1:20))
df1[sample(1:20, 5),] = NA
df2 = data.frame(a = rnorm(1:20))
df2[sample(1:20, 3),] = NA
df2$time = "t2"
df1$time = "t1"
df = rbind(df1, df2)
df %>% group_by(time) %>% summarise(numNAs = sum(is.na(a)))
histogramDF= df %>% group_by(time) %>% summarise(numNAs = sum(is.na(a)))
qplot(x=time, y = numNAs, fill=time, data = histogramDF, stat='identity', geom="histogram")
I've seen heatmaps with values made in various R graphics systems including lattice and base like this:
I tend to use ggplot2 a bit and would like to be able to make a heatmap with the corresponding cell values plotted. Here's the heat map and an attempt using geom_text:
library(reshape2, ggplot2)
dat <- matrix(rnorm(100, 3, 1), ncol=10)
names(dat) <- paste("X", 1:10)
dat2 <- melt(dat, id.var = "X1")
p1 <- ggplot(dat2, aes(as.factor(Var1), Var2, group=Var2)) +
geom_tile(aes(fill = value)) +
scale_fill_gradient(low = "white", high = "red")
p1
#attempt
labs <- c(apply(round(dat[, -2], 1), 2, as.character))
p1 + geom_text(aes(label=labs), size=1)
Normally I can figure out the x and y values to pass but I don't know in this case since this info isn't stored in the data set. How can I place the text on the heatmap?
Key is to add a row identifier to the data and shape it "longer".
edit Dec 2022 to make code reproducible with R 4.2.2 / ggplot2 3.4.0 and reflect changes in tidyverse semantics
library(ggplot2)
library(tidyverse)
dat <- matrix(rnorm(100, 3, 1), ncol = 10)
## the matrix needs names
names(dat) <- paste("X", 1:10)
## convert to tibble, add row identifier, and shape "long"
dat2 <-
dat %>%
as_tibble() %>%
rownames_to_column("Var1") %>%
pivot_longer(-Var1, names_to = "Var2", values_to = "value") %>%
mutate(
Var1 = factor(Var1, levels = 1:10),
Var2 = factor(gsub("V", "", Var2), levels = 1:10)
)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
ggplot(dat2, aes(Var1, Var2)) +
geom_tile(aes(fill = value)) +
geom_text(aes(label = round(value, 1))) +
scale_fill_gradient(low = "white", high = "red")
Created on 2022-12-31 with reprex v2.0.2
There is another simpler way to make heatmaps with values. You can use pheatmap to do this.
dat <- matrix(rnorm(100, 3, 1), ncol=10)
names(dat) <- paste("X", 1:10)
install.packages('pheatmap') # if not installed already
library(pheatmap)
pheatmap(dat, display_numbers = T)
This will give you a plot like this
If you want to remove clustering and use your color scheme you can do
pheatmap(dat, display_numbers = T, color = colorRampPalette(c('white','red'))(100), cluster_rows = F, cluster_cols = F, fontsize_number = 15)
You can also change the fontsize, format, and color of the displayed numbers.