Multigroup frequency with ggplot - r

I'm trying to replicate this histogram in R.
Here is how to mock my dataset:
dft <- data.frame(
menutype = sample(c(1,2,4,5,6,8,12), 120, replace = T),
Belief = sample(c(0,1), 120, replace = T),
Choice = sample(c(0,1), 120, replace = T)
)
Here is my code :
library(ggplot2)
library(dplyr)
library(tidyr)
library(MASS)
df <- data.frame(
menutype = factor(df$menutype, labels = c("GUILT" , "SSB0", "SSB1", "FLEX0", "FLEX1", "STD", "FLEX01"),
levels = c(1,2,4,5,6,8,12)),
Belief = factor(df$belieflearn, levels = c(1), labels= c("Believe Learn")), #Interested only in this condition
Choice = factor(df$learned, levels = c(1), labels= c("Learn")) #Same here
)
df1 <- rbind(na.omit(df %>%
count(Belief, menutype) %>%
group_by(menutype) %>%
mutate(prop = n / sum(n))),
na.omit(df %>%
count(Choice, menutype) %>%
group_by(menutype) %>%
mutate(prop = n / sum(n))))
test <- paste(df1$Belief[1:6],paste(df1$Choice[7:13]))
test[1:6] <- paste(df1$Belief[1:6])
test[7:13] <- paste(df1$Choice[7:13])
df1$combine <- paste(test)
ggplot(data = df1, aes(menutype, prop, fill = combine)) +
labs(title = "Classification based on rank ordering\n", x = "", y = "Fraction of subjects", fill = "\n") +
geom_bar(stat = "identity", position = "dodge")+
theme_bw() +
theme(legend.position="bottom", plot.title = element_text(hjust = 0.5)) #Centering of the main title+
#geom_text(aes(label="ok"), vjust=-0.3, size=3.5)+
The problem is that it's more or less working, I'm almost getting the graph that I want but it is a workaround and there is still some errors. Indeed, I've for example the same value for STD (0.10), while it should be 0 and 0.10 like in the original graph.
What I would like to do optimally is to have two different dataframe, one with menutype and Belief, the other one with menutype and Choice, then as I did, compute the proportion of a specific modality in each latter variables on menutype, and finally to plot it as histograms, much as the graph in the original study. Additionally, I'd like to have the proportions as fractions above each bar, but that is optional.
Could someone help me on this matter? I'm really struggling to get it working.
Thanks in advance!
EDIT: I think the issue is with the fill =. I would like to specify for each bar the variable I want (e.g, fill = df2$Belief & df2$Choice) but I don't know how to proceed.

library(tidyverse)
set.seed(10)
# example data frame
df <- data.frame(
menutype = sample(c(1,2,4,5,6,8,12), 120, replace = T),
Belief = sample(c(0,1), 120, replace = T),
Choice = sample(c(0,1), 120, replace = T)
)
# calculate all metrics based on all variables you want to plot in a tidy way
df_plot = df %>%
group_by(Choice) %>%
count(menutype, Belief) %>%
mutate(prop = n / sum(n),
prop_text = paste0(n, "/", sum(n))) %>%
ungroup()
# barplots using one variable and split plots using another variable
df_plot %>%
mutate(Belief = factor(Belief),
menutype = factor(menutype)) %>%
ggplot(aes(menutype, prop, fill = Belief))+
geom_col(position = "dodge")+
facet_wrap(~Choice, ncol=1)+
geom_text(aes(label=prop_text), position = position_dodge(1), vjust = -0.5)+
ylim(0,0.2)

Related

Creating 4 yes/no bar charts on a single plot from frequency tables in R

I have four variables as columns in my data set:
whether the person had free school meals when they were younger
whether the person's parents attended university
whether the person studied A-level drama at school
whether their school offered A-level drama
Each value in the column is either "yes", "no" or "not applicable".
I want to put four sets of bar charts on one single plot (which I can then save as a .png), with each of the bar charts having a yes bar and a no bar.
I have used the below to create a frequency table for each of the variables. Here I've used the example of whether the person received free school meals (FSM) when they were younger:
FSM_df <- champions %>% count(FSM, sort = TRUE) %>% mutate(pct = prop.table(n))
percentage = label_percent()(FSM_df$pct)
FSM_df$percentage = percentage
I can use the code below to create a single bar chart, but I'm not sure how to do this for multiple plots:
ggplot(FSM_df, aes(x = FSM, y = n, fill = "#fe8080")) + geom_bar(stat = "identity", show.legend = FALSE) + coord_flip() + labs(x = "FSM", y = "Number of Champions") + geom_text(aes(label = percentage), color = "#662483")
Generating Random Data
lunch <- sample(0:1, 100, replace = TRUE, prob = c(0.7,0.3))
parents <- sample(0:1, 100, replace = TRUE, prob = c(0.5,0.5))
drama_major <- sample(0:1, 100, replace = TRUE, prob = c(0.9,0.1))
drama_offered <- sample(0:1, 100, replace = TRUE, prob = c(0.8,0.1))
Creating the Tibble
df <- tibble(lunch = lunch,
parents = parents,
drama_major = drama_major,
drama_offered)
pivot_longer
df %>%
pivot_longer(cols = 1:4,
names_to = "measure",
values_to = "measure_is_true_1") %>%
mutate(is_true = if_else(measure_is_true_1 == 0, "no", "yes")) %>%
ggplot(aes(x = measure)) +
geom_bar(aes(fill = is_true), position = "dodge", alpha = 0.7) +
coord_flip() +
theme_bw()
^ in this example, you should convert your data to long format and then set the grouping aesthetics using the fill parameters. The ggplot logic should be: plot my groups along the x axis and count the frequency for each time it's a 0 or 1 in the response column (whether or not they were on free lunch/drama, etc). This is how you can achieve it all on the same plot.
Simple Bar Chart Plot

How to plot line graph of normalized differences from binned data with ggplot?

I have several sets of data that I calculate binned normalized differences for. The results I want to plot within a single line plot using ggplot. The lines representing different combinations of the paired differences are supposed to be distinguished by colors and line types.
I am stuck on taking the computed values from the bins (would be y-axis values now), and plotting these onto an x-axis.
Below is the code I use for importing the data and calculating the normalized differences.
# Read data from column 3 as data table for different number of rows
# you could use replicate here for test
# dat1 <- data.frame(replicate(1,sample(25:50,10000,rep=TRUE)))
# dat2 <- data.frame(replicate(1,sample(25:50,9500,rep=TRUE)))
dat1 <- fread("/dir01/a/dat01.txt", header = FALSE, data.table=FALSE, select=c(3))
dat2 <- fread("/dir02/c/dat02.txt", header = FALSE, data.table=FALSE, select=c(3))
# Change column names
colnames(dat1) <- c("Dat1")
colnames(dat2) <- c("Dat2")
# Perhaps there is a better way to compute the following as all-in-one? I have broken these down step by step.
# 1) Sum for each bin
bin1 = cut(dat1$Dat1, breaks = seq(25, 50, by = 2))
sum1 = tapply(dat1$Dat1, bin1, sum)
bin2 = cut(dat2$Dat2, breaks = seq(25, 50, by = 2))
sum2 = tapply(dat2$Dat2, bin2, sum)
# 2) Total sum of all bins
sumt1 = sum(sum1)
sumt2 = sum(sum2)
# 3) Divide each bin by total sum of all bins
sumn1 = lapply(sum1, `/`, sumt1)
sumn2 = lapply(sum2, `/`, sumt2)
# 4) Convert to data frame as I'm not sure how to difference otherwise
df_sumn1 = data.frame(sumn1)
df_sumn2 = data.frame(sumn2)
# 5) Difference between the two as percentage
dbin = (df_sumn1 - df_sumn2)*100
How can I plot those results using ggplot() and geom_line()?
I want
dbin values on the x-axis ranging from 25-50
different colors and line types for the lines
Here is what I tried:
p1 <- ggplot(dbin, aes(x = ?, color=Data, linetype=Data)) +
geom_line() +
scale_linetype_manual(values=c("solid")) +
scale_x_continuous(limits = c(25, 50)) +
scale_color_manual(values = c("#000000"))
dput(dbin) outputs:
structure(list(X.25.27. = -0.0729132928804117, X.27.29. = -0.119044772581772,
X.29.31. = 0.316016473225017, X.31.33. = -0.292812782147632,
X.33.35. = 0.0776336591308158, X.35.37. = 0.0205584754637611,
X.37.39. = -0.300768421159599, X.39.41. = -0.403235174844081,
X.41.43. = 0.392510458816457, X.43.45. = 0.686758883448307,
X.45.47. = -0.25387105113263, X.47.49. = -0.0508324553382303), class = "data.frame", row.names = c(NA,
-1L))
Edit
The final piece of code that works, using only the dbin and plots multiple dbins:
dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100)))
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100)))
dat3 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 12:37/100)))
dat4 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 37:12/100)))
calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}
diff_data <-
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
mutate(dbinA = (sum.x - sum.y * 100)) %>%
select(-starts_with("trsh"))
diff_data2 <-
full_join(
calc_bin_props(data = dat3),
calc_bin_props(dat4),
by = "bin") %>%
separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
mutate(dbinB = (sum.x - sum.y * 100)) %>%
select(-starts_with("trsh"))
# Combine two differences, and remove sum.x and sum.y
full_data <- cbind(diff_data, diff_data2[,4])
full_data <- full_data[,-c(2:3)]
# Melt the data to plot more than 1 variable on a plot
m <- melt(full_data, id.vars="bin")
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(m, aes(as.numeric(bin), value, col=variable, linetype = variable)) +
geom_line() +
scale_linetype_manual(values=c("solid", "longdash")) +
scale_color_manual(values = c("black", "black"))
dev.off()
library(tidyverse)
Creating example data as shown in question, but adding different probabilities to the two sample() calls, to create so visible difference
between the two sets of randomized data.
dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100))) %>% as_tibble()
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100))) %>% as_tibble()
Using dplyr we can handle this within data.frames (tibbles) without
the need to switch to other datatypes.
Let’s define a function that can be applied to both datasets to get
the preprocessing done.
We use base::cut() to create
a new column that pairs each value with its bin. We then group the data
by bin, calculate the sum for each bin and finally divide the bin sums
by the total sum.
calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2), labels = seq(25, 48, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}
Now we call calc_bin_props() on both datasets and join them by bin.
This gives us a dataframe with the columns bin, sum.x and sum.y.
The latter two are correspond to the bin sums derived from dat1 and
dat2. With the mutate() line we calculate the differences between the
two columns.
diff_data <-
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
mutate(dbin = (sum.x - sum.y),
bin = as.numeric(as.character(bin))) %>%
select(-starts_with("trsh"))
Before we feed the data into ggplot() we convert it to the long
format using pivot_longer() this allows us to instruct ggplot() to
plot the results for sum.x, sum.y and dbin as separate lines.
diff_data %>%
pivot_longer(-bin) %>%
ggplot(aes(as.numeric(bin), value, color = name, linetype = name)) +
geom_line() +
scale_linetype_manual(values=c("longdash", "solid", "solid")) +
scale_color_manual(values = c("black", "purple", "green"))

ggplot2::geom_text(): how to display all factor levels, but suppress specific values like '0':

Here is code to give context to my question:
set.seed(1); tibble(x=factor(sample(LETTERS[1:7],7,replace = T),levels = LETTERS[1:7])) %>% group_by_all() %>% count(x,.drop = F) %>%
ggplot(mapping = aes(x=x,y=n))+geom_bar(stat="identity")+geom_text(
aes(label = n, y = n + 0.05),
position = position_dodge(1),
vjust = 0)
I want ALL of the levels of the variable x to be displayed on the x-axis (LETTERS[1:7]). For each Level with n>0, I want the value to display atop the bar for that level. For each level with n==0, I want the value label to NOT be displayed. Currently, the plot displays the 0 for 'empty' factor levels c("C","F"), and I want to suppress the display of '0's for those levels, but still display "C", and "F" on the x-axis.
I hope someone might be able to help me.
Thanks.
A simple ifelse() will do it. You can enter any text you like for example ifelse( n>0, n , "No Data")
library( tidyr)
library( ggplot2)
library( dplyr )
set.seed(1); tibble(x=factor(sample(LETTERS[1:7],7,replace = T),levels = LETTERS[1:7])) %>% group_by_all() %>% count(x,.drop = F) %>%
ggplot(mapping = aes(x=x,y=n))+geom_bar(stat="identity")+
geom_text(
aes(label = ifelse( n>0, n , ""), y = n + 0.05),
position = position_dodge(1),
vjust = 0)
You pass a function to the data argument inside geom_test, for this example you can do a subset on the piped data (referred as .x):
set.seed(1);
tibble(x=factor(sample(LETTERS[1:7],7,replace = T),levels = LETTERS[1:7])) %>% group_by_all() %>% count(x,.drop = F) %>%
ggplot(mapping = aes(x=x,y=n))+geom_bar(stat="identity")+
geom_text(data=~subset(.x,n>0),
aes(label = n, y = n + 0.05),
position = position_dodge(1),
vjust = 0)

How do I get a single percentage/proportion plot using ggplot for separate groups?

df <- data.frame(k = sample(1:3, 100, replace = TRUE),
g = sample(1:2, 100, replace = TRUE, prob = c(0.3, 0.7)))
In this data frame I have two groups g which members are in one of three conditions k.
Now, I want to see the proportions of the conditions k in both groups.
ggplot(df, aes(x = k, fill = as.factor(g), y = (..count..)/sum(..count..))) +
geom_bar(position=position_dodge())
That looks nice at first but there is a problem. The group 2 is larger than group 1. Therefore the proportions are not right: It looks as if all conditions were more likely in group 2 than in group 1. I need to calculate the y = (..count..)/sum(..count..) for both groups separately. How do I do this?
Here's how you can do it:
library(tidyverse)
df %>%
group_by(g) %>%
count(k) %>%
mutate(share = n / sum(n)) %>%
ggplot(aes(x = k, fill = as.factor(g), y = share)) +
geom_col(position = position_dodge())

Specify the order for groups when using unite from dplyr for plotting with ggplot

I wanted to do something like this
Add multiple comparisons using ggsignif or ggpubr for subgroups with no labels on x-axis
I got this far:
Packages and Example data
library(tidyverse)
library(ggpubr)
library(ggpol)
library(ggsignif)
example.df <- data.frame(species = sample(c("primate", "non-primate"), 50, replace = TRUE),
treated = sample(c("Yes", "No"), 50, replace = TRUE),
gender = sample(c("male", "female"), 50, replace = TRUE),
var1 = rnorm(50, 100, 5))
Levels
example.df$species <- factor(example.df$species,
levels = c("primate", "non-primate"), labels = c("p", "np"))
example.df$treated <- factor(example.df$treated,
levels = c("No", "Yes"), labels = c("N","Y"))
example.df$gender <- factor(example.df$gender,
levels = c("male", "female"), labels = c("M", "F"))
Since I have had no luck in getting either ggsignif or ggpubr to work with placing the significant groups correctly when the groups they need to refer to are not explicitly named in the x-axis (as they are subgroups of each variable in the x-axis and are indicated only in the fill legend and not the x-axis, I tried this instead.
example.df %>%
unite(groups, species, treated, remove = F, sep= "\n") %>%
{ggplot(., aes(groups, var1, fill= treated)) +
geom_boxjitter() +
facet_wrap(~ gender, scales = "free") +
ggsignif::geom_signif(comparisons = combn(sort(unique(.$groups)), 2, simplify = F),
step_increase = 0.1)}
I get this,
Faceted plot with significance values computed for every group
However, the order of the combined groups on the x -axis is not how I want it. I want to order it with p/N, np/N, p/Y, np/Y for each facet.
How do I do this? Any help is greatly appreciated.
Edit: Creating a new variable using mutate and making it an ordered factor with my preferred plotting order solves.
example.df %>%
unite(groups, species, treated, remove = F, sep= "\n") %>%
mutate(groups2 = factor(groups, levels = c("p\nN", "np\nN", "p\nY", "np\nY"),
ordered = TRUE)) %>%
{ggplot(., aes(groups2, var1, fill= treated)) +
geom_boxjitter() +
facet_wrap(~gender,scales = "free") +
ggsignif::geom_signif(comparisons = combn(sort(unique(.$groups2)), 2, simplify = F),
step_increase = 0.1)}
But I am still looking for solutions to not having to use unite at all and keeping the original factors and still get significance values to plot using ggsignif or ggpubr.
The default parameters for interaction (from the base package) appear to give the factor ordering you are looking for:
example.df %>%
mutate(groups = interaction(species, treated, sep = "\n")) %>%
{ggplot(., aes(groups, var1, fill= treated)) +
geom_boxjitter() +
facet_wrap(~ gender, scales = "free") +
geom_signif(comparisons = combn(sort(as.character(unique(.$groups))), 2, simplify = F),
step_increase = 0.1)}

Resources