Adding cumulative quantities to a geom_bar plots drawn with facet_wrap - r

newbie here! After a long search I still could not find a satisfying solution to my problem. I have a dataset of heart failure rates (https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records) and I would like to display a series of geom plot where the "Sruvived" and "Dead" are counted per category (i.e. sex, smoking and so on).
I think i have done a decent job at preparing the plots, and they look right to me. The problem is, it is difficult to see the how the ratio between surviving and dying patient with different characteristics is.
I have two but both of them elude me:
Put a count on top of every bar so that the ratio becomes obvious
Directly show the ratio on every characteristic.
Here is the code I wrote.
library(ggplot)
heart_faliure_data <- read.csv(file = "heart_failure_clinical_records_dataset.csv", header = FALSE, skip=1)
#Prepare Column Names
c_names <- c("Age",
"Anaemia",
"creatinine_phosphokinase",
"diabetes",
"ejection_fraction",
"high_blood_pressure",
"platelets",
"serum_creatinine",
"serum_sodium",
"sex",
"smoking",
"time",
"DEATH_EVENT")
#Apply column names to the dataframe
colnames(heart_faliure_data) <- c_names
# Some Classes like sex, Anaemia, diabetes, high_blood_pressure smoking and DEATH_EVENT are booleans
# (see description of Dataset) and should be transformed into factors
heart_faliure_data$sex <- factor(heart_faliure_data$sex,
levels=c(0,1),
labels=c("Female","Male"))
heart_faliure_data$smoking <- factor(heart_faliure_data$smoking,
levels=c(0,1),
labels=c("No","Yes"))
heart_faliure_data$DEATH_EVENT <- factor(heart_faliure_data$DEATH_EVENT,
levels=c(0,1),
labels=c("Survived","Died"))
heart_faliure_data$high_blood_pressure <- factor(heart_faliure_data$high_blood_pressure,
levels=c(0,1),
labels=c("No","Yes"))
heart_faliure_data$Anaemia <- factor(heart_faliure_data$Anaemia,
levels=c(0,1),
labels=c("No","Yes"))
heart_faliure_data$diabetes <- factor(heart_faliure_data$diabetes,
levels=c(0,1),
labels=c("No","Yes"))
# Adjust Age to a int value
heart_faliure_data$Age <- as.integer(heart_faliure_data$Age)
# selecting the categorical variables and study the effect of each variable on death-event
categorical.heart_failure <- heart_faliure_data %>%
select(Anaemia,
diabetes,
high_blood_pressure,
sex,
smoking,
DEATH_EVENT) %>%
gather(key = "key", value = "value", -DEATH_EVENT)
#Visualizing this effect with a grouped barplot
categorical.heart_failure %>%
ggplot(aes(value)) +
geom_bar(aes(x = value,
fill = DEATH_EVENT),
alpha = .2,
position = "dodge",
color = "black",
width = .7,
stat = "count") +
labs(x = "",
y = "") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
facet_wrap(~ key,
scales = "free",
nrow = 4) +
scale_fill_manual(values = c("#FFA500", "#0000FF"),
name = "Death Event",
labels = c("Survived", "Dead"))
And here is a (not so bad) image of the result:
The goal would be to have some numerical value on top of the bars. Or even just a a y indication...
I would be glad about any help you can give me!

What about something like this. To make it work, I aggregated the data first:
tmp <- categorical.heart_failure %>%
group_by(DEATH_EVENT, key, value) %>%
summarise(n = n())
#Visualizing this effect with a grouped barplot
tmp %>%
ggplot(aes(x = value, y=n)) +
geom_bar(aes(fill = DEATH_EVENT),
alpha = .2,
position = position_dodge(width=1),
color = "black",
width = .7,
stat = "identity") +
geom_text(aes(x=value, y=n*1.1, label = n, group=DEATH_EVENT), position = position_dodge(width=1), vjust=0) +
labs(x = "",
y = "") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
facet_wrap(~ key,
scales = "free",
nrow = 4) +
scale_fill_manual(values = c("#FFA500", "#0000FF"),
name = "Death Event",
labels = c("Survived", "Dead")) +
coord_cartesian(ylim=c(0, max(tmp$n)*1.25))

Related

Removing NA category from grouped bar charts

I am currently working with survey data with 250 columns. A sample of my data looks like this:
q1 <- factor(c("yes",NA,"no","yes",NA,"yes","no","yes"))
q2 <- factor(c("Albania","USA","Albania","Albania","UK",NA,"UK","Albania"))
q3 <- factor(c(0,1,NA,0,1,1,NA,0))
q4 <- factor(c(0,NA,NA,NA,1,NA,0,0))
q5 <- factor(c("Dont know","Prefer not to answer","Agree","Disagree",NA,"Agree","Agree",NA))
q6 <- factor(c(1,NA,3,5,800,NA,900,2))
sector <- factor(c("Energy","Water","Energy","Other","Other","Water","Transportation","Energy"))
data <- data.frame(q1,q2,q3,q4,q5,q6,sector)
I have created a function to loop through all 250 columns and create grouped bar charts where x axis shows sectors, y axis shows percentage distribution of answers and fill is the underlying column from data. Below you can see the code for the function:
by_sector <- lapply(names(data), function(variable) {
ggplot(
data = data,
mapping = aes(x=sector,fill = data[[variable]])
) +
geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge") +
labs(x = variable, y = "% of total", fill = "Response", caption = paste("Total =", sum(!is.na(data[[variable]])))) +
geom_text(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..], label=scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..],accuracy = 0.1) ),
stat="count", position=position_dodge(1), vjust=0.5)+
#scale_fill_brewer(palette = "Accent")+
scale_fill_discrete(na.translate = FALSE) +
theme_bw() +
theme(panel.grid.major.y = element_blank()) +
coord_flip()
})
As you can see from image below, since I use data columns as fill, there is transparent NA category showing up. I want to remove that category from grouped bars.
enter image description here
I tried couple of things:
scale_fill_discrete(na.translate = FALSE) This just removed NA from legend not from grouped bars.
fill = subset(data,!is.na(data[[variable]])) This didn't work
ggplot(data=na.omit(data[[variable]])) This didn't work neither.
Is there a way to modify my code for barplots so that NA category doesn't show up as a bar in the graph? Thank you very much beforehand!
One option would be to aggregate your data outside of ggplot() which makes it easier to debug, removes the duplicated computations inside the code and makes it easy to drop the NA categories if desired.
Additionally, I moved the plotting code to a separate function which also allows for easier debugging by e.g. running the code for just one example.
Finally note, that I switched to the .data pronoun as the recommend way to use column names passed as strings.
Showing only the plots for two of the problematic columns:
EDIT Fixed a small bug by removing the NA values before aggregating instead of doing that afterwards.
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
plot_fun <- function(variable) {
total <- sum(!is.na(data[[variable]]))
data <- data |>
filter(!is.na(.data[[variable]])) |>
group_by(across(all_of(c("sector", variable)))) |>
summarise(n = n(), .groups = "drop_last") |>
mutate(pct = n / sum(n)) |>
ungroup()
ggplot(
data = data,
mapping = aes(x = sector, y = pct, fill = .data[[variable]])
) +
geom_col(position = "dodge") +
labs(
x = variable, y = "% of total", fill = "Response",
caption = paste("Total =", total)
) +
geom_text(
aes(
label = scales::percent(pct, accuracy = 0.1)
),
position = position_dodge(.9), vjust = 0.5
) +
scale_fill_brewer(palette = "Accent") +
theme_bw() +
theme(panel.grid.major.y = element_blank()) +
coord_flip()
}
by_sector <- lapply(names(data), plot_fun)
by_sector[c(3, 6)]
#> [[1]]
#>
#> [[2]]

How to re-order bar plot with ggplot2 by distribution proportions of variables

I'm looking for a way to re-order the bar plot produced with ggplot2 such that the rates of the less observed category (i.e. ThemeFirst) increase from the left to the right. The original bar plot I generated is below:
And it is plotted using the following codes:`
t1<-table(data$Variety,data$realization)
dataframe_realization<-data.frame(Variety=names(prop.table(t1,1)[,1]),
RecipientFirst=prop.table(t1,1)[,1],
ThemeFirst=prop.table(t1,1)[,2],
row.names=NULL)
dataframe_realization<-melt(dataframe_realization,id="Variety",variable_name="Variant")
# adding absolute frequency values to the table
dataframe_realization_absfreq<-data.frame(Variety=names(t1[,1]),
RecipientFirst=as.numeric(t1[,1]),
ThemeFirst=as.numeric(t1[,2]))
dataframe_realization_absfreq<-melt(dataframe_realization_absfreq,id="Variety",variable_name = "Variant")
dataframe_realization$absvals<-dataframe_realization_absfreq$value
dataframe_realization$Proportion<-dataframe_realization$value
dataframe_realization$variable<-dataframe_realization$Variant
labels.order <- dataframe_realization %>%
filter(Variety == '14th-18thCentury') %>%
arrange(Proportion) %>%
pull(Variant)
df.new <- dataframe_realization %>%
mutate(
Variable = factor(Variant, levels = labels.order, ordered = T)
)
# stacked bar plot with absolute values added on the each bar
realization_plot_absvals<-ggplot(data = dataframe_realization, aes(Variety, Proportion, group = Variant)) +
geom_col(aes(fill = Variant)) +
labs(title = "", y="Proportion of theme-recipient tokens", x="") +
scale_y_continuous() +
scale_fill_grey(start = 0.25, end = 0.75) +
geom_text(aes(label = absvals), position = position_stack(vjust = 0.5),color=ifelse(dataframe_realization$Variant=="RecipientFirst", "white", "black"), ) +
theme(text=element_text(size=15))
ggsave("~/VADIS_VarietyProportion_absvals.png",realization_plot_absvals, width=13, height=6, units="in", dpi = 1000)
So, again, the idea is to rearrange the plot and the bar to the far left will be the "Variety" with least ThemeFirst proportion (namely 19thCentury), and the bar to the far right will be the one with most ThemeFirst proportion (namely CTM_Other). The data for producing the plot can be found in this OSF page.
Just make a factor of the x-axis variable, with the levels based on the order of proportions like you did for labels.order.
library(dplyr)
library(reshape2)
library(ggplot2)
variety.order <- dataframe_realization %>%
filter(Variant == labels.order[1]) %>%
arrange(Proportion) %>%
pull(Variety)
df.new <- dataframe_realization %>%
mutate(
Variable = factor(Variant, levels = labels.order, ordered = T),
Variety = factor(Variety, levels = variety.order)
)
# stacked bar plot with absolute values added on the each bar
realization_plot_absvals<-ggplot(data = df.new, aes(Variety, Proportion, group = Variant)) +
geom_col(aes(fill = Variant)) +
labs(title = "", y="Proportion of theme-recipient tokens", x="") +
scale_y_continuous() +
scale_fill_grey(start = 0.25, end = 0.75) +
geom_text(aes(label = absvals), position = position_stack(vjust = 0.5),color=ifelse(dataframe_realization$Variant=="RecipientFirst", "white", "black"), ) +
theme(text=element_text(size=15))
(note: the option variable_name in the function reshape::melt should be variable.name)

Is it possible to adjust a second graph to a second y-axis in ggplot?

I am trying to make a several bar plots with their standard errors added to the plot. I tried to add a second y-axis, which was not that hard, however, now I also want my standard errors to fit this new y-axis. I know that I can manipulate the y-axis, but this is not really what I want. I want it such that the standard errors fit to this new y-axis. To illustrate, this is the plot I have now, where I just divided the first y-axis by a 100.
but what I want it something more like this
How it should look like using Excel
to show for all barplots (this was done for the first barplot using Excel). Here is my code
df_bar <- as.data.frame(
rbind(
c('g1', 0.945131015, 1.083188828, 1.040164338,
1.115716593, 0.947886795),
c('g2', 1.393211286, 1.264193745, 1.463434395,
1.298126006, 1.112718796),
c('g3', 1.509976099, 1.450923745, 1.455102201,
1.280102338, 1.462689245),
c('g4', 1.591697668, 1.326292649, 1.767207296,
1.623619341, 2.528108183),
c('g5', 2.625114848, 2.164050167, 2.092843287,
2.301950359, 2.352736806)
)
)
colnames(df_bar)<-c('interval', 'lvl3.Mellem.Høj', 'lvl1.Lav', 'TOM',
',lvl4.Høj', 'lvl2.Lav.Mellem')
df_bar <- melt(df_bar, id.vars = "interval",
variable.name = "name",
value.name = "value")
df_line <- as.data.frame(
rbind(
c('g1', 0.0212972, 0.0164494, 0.0188898, 0.01888982,
0.03035883),
c('g2', 0.0195600, 0.0163811, 0.0188747, 0.01887467,
0.03548092),
c('g3', 0.0192249, 0.0161914, 0.02215852, 0.02267605,
0.03426538),
c('g4', 0.0187961, 0.0180842, 0.01962371, 0.02103450,
0.03902890),
c('g5', 0.0209987, 0.0164596, 0.01838280, 0.02282300,
0.03516818)
)
)
colnames(df_line)<-c('interval', 'lvl3.Mellem.Høj', 'lvl1.Lav', 'TOM',
',lvl4.Høj', 'lvl2.Lav.Mellem')
df_line <- melt(df_line, id.vars = "interval",
variable.name = "name",
value.name = "sd")
df <- inner_join(df_bar,df_line, by=c("interval", "name"))
df %>%
mutate(value = as.numeric(value)) %>%
mutate(sd = as.numeric(sd)) %>%
mutate(interval = as.factor(interval)) %>%
mutate(name = as.factor(name)) %>%
ggplot() +
geom_bar(aes(x = interval, y = value, fill = interval), stat = "identity") +
geom_line(aes(x = interval, y = sd, group = 1),
color = "black", size = .75) +
scale_y_continuous("Value", sec.axis = sec_axis(~ . /100, name = "sd")) +
facet_grid(~name, scales = "free") +
theme_bw() + theme(legend.position = "none") +
xlab("Interval") + ylab("Value") +
labs(caption = "Black line indicates standard deviation.")
Thanks in advance..
As described in this example, you have to also perform a transformation to your values from sd to match the scale of your second axis. In your example you divided by 100, therefore you have to multiply your sd by 100 as shown in the below:
library(tidyverse)
library(data.table)
df %>%
mutate(value = as.numeric(value)) %>%
mutate(sd = as.numeric(sd)) %>%
mutate(interval = as.factor(interval)) %>%
mutate(name = as.factor(name)) %>%
ggplot() +
geom_bar(aes(x = interval, y = value, fill = interval), stat = "identity") +
scale_y_continuous("Value", sec.axis = sec_axis(~ ./100, name = "sd"))+
geom_line(aes(x = interval, y = sd*100, group = 1),
color = "black", size = .75)+
facet_grid(~name, scales = "free")+
theme_bw() + theme(legend.position = "none") +
xlab("Interval") + ylab("Value") +
labs(caption = "Black line indicates standard deviation.")
You can also use a different value to scale your second axis. In this example I used 50 as a scaling factor, which in my opinion looks a bit better:
Created on 2022-08-25 with reprex v2.0.2
Here is what it should look like for the first barplot using Excel.

R ggplot: I am having problem drawing R histogram

I am working with the built-in esoph dataset. My task is to formulate a histogram of "ncontrols" variable for each age group in the dataset
Here are the codes I write down. First, I do the group_by on agegp (age groups,) calculate the total ncontrols (number of control cases) for each age group, and rename both agegp and ncontrols to something easily readable
library(tidyverse)
library(datasets)
library(ggplot2)
data_esoph <- esoph %>% group_by(agegp) %>%
summarise(Total_number_of_control_case = sum(ncontrols)) %>%
rename(Age_group = agegp)
Then I try to draw a histogram using geom_histogram
plot_histogram <- ggplot(data_esoph, aes(x = Age_group)) +
geom_histogram(color = 'black', fill = 'grey70') +
labs(title ="Number of control cases by age group",x = "Age group", y = "Cases")+
theme(axis.title= element_text(size = 12), plot.title = element_text(size = 16))
I run into an error that says
Error: StatBin requires a continuous x variable: the x variable is discrete.Perhaps you want stat="count"?
I know this error is because agegp (Age_group) is discrete variable. I try to convert it to numeric but to no avail. Anyone have any idea what can I do to fix this problem and draw a histogram ?
You can set stat="identity" to the geom_bar like this:
library(tidyverse)
df %>%
ggplot(aes(x = Age_group, y = Total_number_of_control_case)) +
geom_bar(stat = "identity") +
labs(title ="Number of control cases by age group",x = "Age group", y = "Cases") +
theme(axis.title= element_text(size = 12), plot.title = element_text(size = 16))
Output:
Data
df <- data.frame(Age_group = c("25-34", "35-44", "45-54", "55-64", "65-74", "75+"),
Total_number_of_control_case = c(115,190,167,166,106,31))

Summarizing data into percentages for side-by-side Bar Charts in R

Below is the code I am having trouble with and its output. The data set is linked at the bottom of the post.
What I am wanting to do is group the StateCodes together with each MSN (opposite of what is showing now in the output).
plotdata <- EnergyData %>%
filter(MSN %in% c("BMTCB", "GETCB", "HYTCB", "SOTCB", "WYTCB")) %>%
filter(Year %in% c("2009")) %>%
select(StateCode, MSN, Data) %>%
group_by(StateCode) %>%
mutate(pct = Data/sum(Data),
lbl = scales::percent(pct))
plotdata
This outputs to:
I thought that the group_by function would do that for me but I would like to know if I am missing a key chunk of code?
Once the above chunk runs correctly, I want to create side by side Bar charts by StateCode using the percentages of each of the 5 MSN's.
Here's the code I have so far.
ggplot(EnergyData,
aes(x = factor(StateCode,
levels = c("AZ", "CA", "NM", "TX")),
y = pct,
fill = factor(drv,
levels = c("BMTCB", "GETCB", "HYTCB", "SOTCB", "WYTCB"),
labels = c("BMTCB", "GETCB", "HYTCB", "SOTCB", "WYTCB")))) +
geom_bar(stat = "identity",
position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .2),
label = pct) +
geom_text(aes(label = lbl),
size = 3,
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = "Set2") +
labs(y = "Percent",
fill = "MSN",
x = "State",
title = "Renewable Resources by State") +
theme_minimal()
As of now I believe this all has to do with how I create the percentages for the bar charts.
Any assistance would be great. Thank you!
Here's the data I used Energy Data http://www.mathmodels.org/Problems/2018/MCM-C/ProblemCData.xlsx
Here is a version using data.table for the initial filtering, and changes to the plot function that hopefully get you the result you are after:
library(readxl)
library(data.table)
library(ggplot2)
download.file("http://www.mathmodels.org/Problems/2018/MCM-C/ProblemCData.xlsx", "~/ex/ProblemCData.xlsx")
# by default, factor levels will be in alphabetical order, so we do not need to specify that
EnergyData <- data.table(read_xlsx("~/ex/ProblemCData.xlsx"), key="StateCode", stringsAsFactors = TRUE)
# filter by Year and MSN list
plotdata <- EnergyData[as.character(MSN) %chin% c("BMTCB", "GETCB", "HYTCB", "SOTCB", "WYTCB") & Year == 2009]
# calculate percentages of Data by StateCode
plotdata[, pct := Data/sum(Data), by = "StateCode"]
# plot using percent format and specified number of breaks
ggplot(plotdata,
aes(x = StateCode,
y = pct,
fill = MSN)) +
geom_bar(stat = "identity",
position = "fill") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), n.breaks = 6) +
scale_fill_brewer(palette = "Set2") +
labs(y = "Percent",
fill = "MSN",
x = "State",
title = "Renewable Resources by State") +
theme_minimal()
Created on 2020-03-20 by the reprex package (v0.3.0)

Resources