Adding weights to barplot function in ggplot2 - r

I am working with survey data with 250 columns. A sample of my data looks like this:
q1 <- factor(c("yes",NA,"no","yes",NA,"yes","no","yes"))
q2 <- factor(c("Albania","USA","Albania","Albania","UK",NA,"UK","Albania"))
q3 <- factor(c(0,1,NA,0,1,1,NA,0))
q4 <- factor(c(0,NA,NA,NA,1,NA,0,0))
q5 <- factor(c("Dont know","Prefer not to answer","Agree","Disagree",NA,"Agree","Agree",NA))
q6 <- factor(c(1,NA,3,5,800,NA,900,2))
sector <- factor(c("Energy","Water","Energy","Other","Other","Water","Transportation","Energy"))
weights <- factor(c(0.13,0.25,0.13,0.22,0.22,0.25,0.4,0.13)
data <- data.frame(q1,q2,q3,q4,q5,q6,sector,weights)
With the help from stackoverflow I have created following function to loop through columns and create bar charts where x axis shows percentage of responses, y axis shows underlying column and fill is the sectors.
plot_fun <- function(variable) {
total <- sum(!is.na(data[[variable]]))
data <- data |>
filter(!is.na(.data[[variable]])) |>
group_by(across(all_of(c("sector", variable)))) |>
summarise(n = n(), .groups = "drop_last") |>
mutate(pct = n / sum(n)) |>
ungroup()
ggplot(
data = data,
mapping = aes(fill = sector, x = pct, y = .data[[variable]])
) +
geom_col(position = "dodge") +
labs(
y = variable, x = "Percentage of responses", fill = "Sector legend",
caption = paste("Total =", total)
) +
geom_text(
aes(
label = scales::percent(pct, accuracy = 0.1)
),
position = position_dodge(.9), vjust = 0.5
) +
scale_x_continuous(labels=function(x) paste0(x*100))+
scale_fill_brewer(palette = "Accent")+
theme_bw() +
theme(panel.grid.major.y = element_blank())
}
Now I want to apply survey weights so that bar charts will show weighted response percentages. I have tried to add weight = data$weights to mapping() but it didn't work. I have also tried to apply weights in the calculation of percentages by doing summarise(n= sum(weights)) but it didn't work neither.
Is there a way to modify my code so that weights are applied? Thank you beforehand.

It's still not clear how you are looking to apply the weights. I've assumed here you want to multiply the percentage by the weight. Note you need to fix your data. Weight should not be factor if you want to use it as a numerical value for calculation. Anyhow, used weights in the group_by so that they carry through, and then in mutate to create a weighted percentage.
total <- sum(!is.na(data[[variable]]))
data <- data |>
filter(!is.na(.data[[variable]])) |>
group_by(across(all_of(c("sector", "weights", variable)))) |>
summarise(n = n(), .groups = "drop_last") |>
mutate(pct = n / sum(n), wpct = pct*weights) |>
ungroup()
ggplot(
data = data,
mapping = aes(fill = sector, x = wpct, y = .data[[variable]])
) +
geom_col(position = "dodge") +
labs(
y = variable, x = "Percentage of responses", fill = "Sector legend",
caption = paste("Total =", total)
) +
geom_text(
aes(
label = scales::percent(wpct, accuracy = 0.1)
),
position = position_dodge(.9), vjust = 0.5
) +
scale_x_continuous(labels=function(x) paste0(x*100))+
scale_fill_brewer(palette = "Accent")+
theme_bw() +
theme(panel.grid.major.y = element_blank())
}
If this doesn't do the trick, do clarify how you look to use the weights and what the final outcome values should be.

Related

Removing NA category from grouped bar charts

I am currently working with survey data with 250 columns. A sample of my data looks like this:
q1 <- factor(c("yes",NA,"no","yes",NA,"yes","no","yes"))
q2 <- factor(c("Albania","USA","Albania","Albania","UK",NA,"UK","Albania"))
q3 <- factor(c(0,1,NA,0,1,1,NA,0))
q4 <- factor(c(0,NA,NA,NA,1,NA,0,0))
q5 <- factor(c("Dont know","Prefer not to answer","Agree","Disagree",NA,"Agree","Agree",NA))
q6 <- factor(c(1,NA,3,5,800,NA,900,2))
sector <- factor(c("Energy","Water","Energy","Other","Other","Water","Transportation","Energy"))
data <- data.frame(q1,q2,q3,q4,q5,q6,sector)
I have created a function to loop through all 250 columns and create grouped bar charts where x axis shows sectors, y axis shows percentage distribution of answers and fill is the underlying column from data. Below you can see the code for the function:
by_sector <- lapply(names(data), function(variable) {
ggplot(
data = data,
mapping = aes(x=sector,fill = data[[variable]])
) +
geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge") +
labs(x = variable, y = "% of total", fill = "Response", caption = paste("Total =", sum(!is.na(data[[variable]])))) +
geom_text(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..], label=scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..],accuracy = 0.1) ),
stat="count", position=position_dodge(1), vjust=0.5)+
#scale_fill_brewer(palette = "Accent")+
scale_fill_discrete(na.translate = FALSE) +
theme_bw() +
theme(panel.grid.major.y = element_blank()) +
coord_flip()
})
As you can see from image below, since I use data columns as fill, there is transparent NA category showing up. I want to remove that category from grouped bars.
enter image description here
I tried couple of things:
scale_fill_discrete(na.translate = FALSE) This just removed NA from legend not from grouped bars.
fill = subset(data,!is.na(data[[variable]])) This didn't work
ggplot(data=na.omit(data[[variable]])) This didn't work neither.
Is there a way to modify my code for barplots so that NA category doesn't show up as a bar in the graph? Thank you very much beforehand!
One option would be to aggregate your data outside of ggplot() which makes it easier to debug, removes the duplicated computations inside the code and makes it easy to drop the NA categories if desired.
Additionally, I moved the plotting code to a separate function which also allows for easier debugging by e.g. running the code for just one example.
Finally note, that I switched to the .data pronoun as the recommend way to use column names passed as strings.
Showing only the plots for two of the problematic columns:
EDIT Fixed a small bug by removing the NA values before aggregating instead of doing that afterwards.
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
plot_fun <- function(variable) {
total <- sum(!is.na(data[[variable]]))
data <- data |>
filter(!is.na(.data[[variable]])) |>
group_by(across(all_of(c("sector", variable)))) |>
summarise(n = n(), .groups = "drop_last") |>
mutate(pct = n / sum(n)) |>
ungroup()
ggplot(
data = data,
mapping = aes(x = sector, y = pct, fill = .data[[variable]])
) +
geom_col(position = "dodge") +
labs(
x = variable, y = "% of total", fill = "Response",
caption = paste("Total =", total)
) +
geom_text(
aes(
label = scales::percent(pct, accuracy = 0.1)
),
position = position_dodge(.9), vjust = 0.5
) +
scale_fill_brewer(palette = "Accent") +
theme_bw() +
theme(panel.grid.major.y = element_blank()) +
coord_flip()
}
by_sector <- lapply(names(data), plot_fun)
by_sector[c(3, 6)]
#> [[1]]
#>
#> [[2]]

Creating a percentage bar plot in ggplot2

So I am just messing around with the titanic data set.
`
options(digits = 3) # report 3 significant digits
library(tidyverse)
library(titanic)
titanic <- titanic_train %>%
select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare) %>%
mutate(Survived = factor(Survived),
Pclass = factor(Pclass),
Sex = factor(Sex))
`
Now I wanted to make a barplot that shows the three different Pclasses on the x-axis, and per Pclass a side-by-side bar for women and men. Kinda like this:
`
titanic %>%
ggplot(aes(x=Pclass, fill=Sex)) +
geom_bar(position=position_dodge())
`
But instead of the count of men and women in those classes, I want it to display the percentage of them that survived. I've been messing around with this for a while now and can't get it to work. The closest I got was to add this to the aes:
y=(titanic$Survived/(..count..))*100
But this takes the total number of survivors and divides them by the filtered passengers. What I need is the survivors per category (class 1, female / class 1, male / class 2, female / etc.) and divide this by the "..count..". Is there a way to do this? Please let me know.
There are several options to achieve your desired result.
library(tidyverse)
library(titanic)
library(GGally)
Option 1: Use GGAlly::stat_prop
If you are looking for an out-of-the-box option then you could use GGAlly::stat_prop which adds a by aesthetic which could be used to set the denominator for computing proportions.
titanic %>%
ggplot(aes(x = Pclass, y = after_stat(prop), fill = Sex, by = Pclass)) +
geom_bar(position = position_dodge(), stat = "prop") +
geom_text(aes(label = round(100 * after_stat(prop))),
position = position_dodge(.9), stat = "prop", vjust = -.2
)
Option 2: Compute the percentages manually
A second and easy option would be to compute the counts and the proportions manually outside of ggplot().
titanic %>%
count(Pclass, Sex) %>%
group_by(Pclass) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = Pclass, y = prop, fill = Sex)) +
geom_col(position = position_dodge()) +
geom_text(aes(label = round(100 * prop)),
position = position_dodge(.9), vjust = -.2
)
Option 3: Compute the percentages on the fly
Finally, the third option would be to compute the percentages on the fly which requires to compute the percentages per x axis category for which I use ave(). In some sense this is an implementation of what GGAlly::stat_prop offers out of the box:
titanic %>%
ggplot(aes(x = Pclass, y = after_stat(ave(count, x, FUN = function(x) x / sum(x))), fill = Sex)) +
geom_bar(position = position_dodge()) +
geom_text(aes(label = after_stat(round(100 * ave(count, x, FUN = function(x) x / sum(x))))),
position = position_dodge(.9), vjust = -.2, stat = "count"
)
UPDATE To get the proportion who survived I would go for option 2, i.e. compute the counts and percentages manually:
library(tidyverse)
library(titanic)
titanic %>%
count(Pclass, Sex, Survived) %>%
group_by(Sex, Pclass) %>%
mutate(prop = n / sum(n)) %>%
filter(Survived == 1) %>%
ggplot(aes(x = Pclass, y = prop, fill = Sex)) +
geom_col(position = position_dodge()) +
geom_text(aes(label = round(100 * prop)),
position = position_dodge(.9), vjust = -.2
) +
labs(y = "% Survived")

Is it possible to adjust a second graph to a second y-axis in ggplot?

I am trying to make a several bar plots with their standard errors added to the plot. I tried to add a second y-axis, which was not that hard, however, now I also want my standard errors to fit this new y-axis. I know that I can manipulate the y-axis, but this is not really what I want. I want it such that the standard errors fit to this new y-axis. To illustrate, this is the plot I have now, where I just divided the first y-axis by a 100.
but what I want it something more like this
How it should look like using Excel
to show for all barplots (this was done for the first barplot using Excel). Here is my code
df_bar <- as.data.frame(
rbind(
c('g1', 0.945131015, 1.083188828, 1.040164338,
1.115716593, 0.947886795),
c('g2', 1.393211286, 1.264193745, 1.463434395,
1.298126006, 1.112718796),
c('g3', 1.509976099, 1.450923745, 1.455102201,
1.280102338, 1.462689245),
c('g4', 1.591697668, 1.326292649, 1.767207296,
1.623619341, 2.528108183),
c('g5', 2.625114848, 2.164050167, 2.092843287,
2.301950359, 2.352736806)
)
)
colnames(df_bar)<-c('interval', 'lvl3.Mellem.Høj', 'lvl1.Lav', 'TOM',
',lvl4.Høj', 'lvl2.Lav.Mellem')
df_bar <- melt(df_bar, id.vars = "interval",
variable.name = "name",
value.name = "value")
df_line <- as.data.frame(
rbind(
c('g1', 0.0212972, 0.0164494, 0.0188898, 0.01888982,
0.03035883),
c('g2', 0.0195600, 0.0163811, 0.0188747, 0.01887467,
0.03548092),
c('g3', 0.0192249, 0.0161914, 0.02215852, 0.02267605,
0.03426538),
c('g4', 0.0187961, 0.0180842, 0.01962371, 0.02103450,
0.03902890),
c('g5', 0.0209987, 0.0164596, 0.01838280, 0.02282300,
0.03516818)
)
)
colnames(df_line)<-c('interval', 'lvl3.Mellem.Høj', 'lvl1.Lav', 'TOM',
',lvl4.Høj', 'lvl2.Lav.Mellem')
df_line <- melt(df_line, id.vars = "interval",
variable.name = "name",
value.name = "sd")
df <- inner_join(df_bar,df_line, by=c("interval", "name"))
df %>%
mutate(value = as.numeric(value)) %>%
mutate(sd = as.numeric(sd)) %>%
mutate(interval = as.factor(interval)) %>%
mutate(name = as.factor(name)) %>%
ggplot() +
geom_bar(aes(x = interval, y = value, fill = interval), stat = "identity") +
geom_line(aes(x = interval, y = sd, group = 1),
color = "black", size = .75) +
scale_y_continuous("Value", sec.axis = sec_axis(~ . /100, name = "sd")) +
facet_grid(~name, scales = "free") +
theme_bw() + theme(legend.position = "none") +
xlab("Interval") + ylab("Value") +
labs(caption = "Black line indicates standard deviation.")
Thanks in advance..
As described in this example, you have to also perform a transformation to your values from sd to match the scale of your second axis. In your example you divided by 100, therefore you have to multiply your sd by 100 as shown in the below:
library(tidyverse)
library(data.table)
df %>%
mutate(value = as.numeric(value)) %>%
mutate(sd = as.numeric(sd)) %>%
mutate(interval = as.factor(interval)) %>%
mutate(name = as.factor(name)) %>%
ggplot() +
geom_bar(aes(x = interval, y = value, fill = interval), stat = "identity") +
scale_y_continuous("Value", sec.axis = sec_axis(~ ./100, name = "sd"))+
geom_line(aes(x = interval, y = sd*100, group = 1),
color = "black", size = .75)+
facet_grid(~name, scales = "free")+
theme_bw() + theme(legend.position = "none") +
xlab("Interval") + ylab("Value") +
labs(caption = "Black line indicates standard deviation.")
You can also use a different value to scale your second axis. In this example I used 50 as a scaling factor, which in my opinion looks a bit better:
Created on 2022-08-25 with reprex v2.0.2
Here is what it should look like for the first barplot using Excel.

Adding cumulative quantities to a geom_bar plots drawn with facet_wrap

newbie here! After a long search I still could not find a satisfying solution to my problem. I have a dataset of heart failure rates (https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records) and I would like to display a series of geom plot where the "Sruvived" and "Dead" are counted per category (i.e. sex, smoking and so on).
I think i have done a decent job at preparing the plots, and they look right to me. The problem is, it is difficult to see the how the ratio between surviving and dying patient with different characteristics is.
I have two but both of them elude me:
Put a count on top of every bar so that the ratio becomes obvious
Directly show the ratio on every characteristic.
Here is the code I wrote.
library(ggplot)
heart_faliure_data <- read.csv(file = "heart_failure_clinical_records_dataset.csv", header = FALSE, skip=1)
#Prepare Column Names
c_names <- c("Age",
"Anaemia",
"creatinine_phosphokinase",
"diabetes",
"ejection_fraction",
"high_blood_pressure",
"platelets",
"serum_creatinine",
"serum_sodium",
"sex",
"smoking",
"time",
"DEATH_EVENT")
#Apply column names to the dataframe
colnames(heart_faliure_data) <- c_names
# Some Classes like sex, Anaemia, diabetes, high_blood_pressure smoking and DEATH_EVENT are booleans
# (see description of Dataset) and should be transformed into factors
heart_faliure_data$sex <- factor(heart_faliure_data$sex,
levels=c(0,1),
labels=c("Female","Male"))
heart_faliure_data$smoking <- factor(heart_faliure_data$smoking,
levels=c(0,1),
labels=c("No","Yes"))
heart_faliure_data$DEATH_EVENT <- factor(heart_faliure_data$DEATH_EVENT,
levels=c(0,1),
labels=c("Survived","Died"))
heart_faliure_data$high_blood_pressure <- factor(heart_faliure_data$high_blood_pressure,
levels=c(0,1),
labels=c("No","Yes"))
heart_faliure_data$Anaemia <- factor(heart_faliure_data$Anaemia,
levels=c(0,1),
labels=c("No","Yes"))
heart_faliure_data$diabetes <- factor(heart_faliure_data$diabetes,
levels=c(0,1),
labels=c("No","Yes"))
# Adjust Age to a int value
heart_faliure_data$Age <- as.integer(heart_faliure_data$Age)
# selecting the categorical variables and study the effect of each variable on death-event
categorical.heart_failure <- heart_faliure_data %>%
select(Anaemia,
diabetes,
high_blood_pressure,
sex,
smoking,
DEATH_EVENT) %>%
gather(key = "key", value = "value", -DEATH_EVENT)
#Visualizing this effect with a grouped barplot
categorical.heart_failure %>%
ggplot(aes(value)) +
geom_bar(aes(x = value,
fill = DEATH_EVENT),
alpha = .2,
position = "dodge",
color = "black",
width = .7,
stat = "count") +
labs(x = "",
y = "") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
facet_wrap(~ key,
scales = "free",
nrow = 4) +
scale_fill_manual(values = c("#FFA500", "#0000FF"),
name = "Death Event",
labels = c("Survived", "Dead"))
And here is a (not so bad) image of the result:
The goal would be to have some numerical value on top of the bars. Or even just a a y indication...
I would be glad about any help you can give me!
What about something like this. To make it work, I aggregated the data first:
tmp <- categorical.heart_failure %>%
group_by(DEATH_EVENT, key, value) %>%
summarise(n = n())
#Visualizing this effect with a grouped barplot
tmp %>%
ggplot(aes(x = value, y=n)) +
geom_bar(aes(fill = DEATH_EVENT),
alpha = .2,
position = position_dodge(width=1),
color = "black",
width = .7,
stat = "identity") +
geom_text(aes(x=value, y=n*1.1, label = n, group=DEATH_EVENT), position = position_dodge(width=1), vjust=0) +
labs(x = "",
y = "") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
facet_wrap(~ key,
scales = "free",
nrow = 4) +
scale_fill_manual(values = c("#FFA500", "#0000FF"),
name = "Death Event",
labels = c("Survived", "Dead")) +
coord_cartesian(ylim=c(0, max(tmp$n)*1.25))

ggplot2 - a custom histogram with a rug plot

I am trying to create a custom histogram with a rug plot showing the original values on the X axis.
I am going to use the mtcars dataset to illustrate. Its not be best dataset for this question...but hopefully the reader will understand what I am trying to achieve...
Below shows the basic histogram, without any rug plot attempt.
I want to create the histogram using geom_bar as this allows for more flexibility with custom bins.
I also want a small gap between the histgram bars (i.e width = 0.95) .... which adds to this
problem's complexity.
library(dplyr)
library(ggplot2)
# create custom bins
vct_seq <- c(seq(from = 10, to = 25, by = 5), 34)
mtcars$bin <- cut(mtcars$mpg, breaks = vct_seq)
# create data.frame for the ggplot graph..using bins above
df_mtcars_count <- mtcars %>% group_by(bin) %>% summarise(count = n())
# indicative labels
vct_labels <- c("bin 1", "bin 2", "bin 3", "bin 4")
# attempt 1 - basic plot -- no rug plot
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p
Next, try and add a basic rug plot on the X axis. This obviously doesn't work as the geom_bar and geom_rug have completely different scales.
# attempt 2 with no scaling.... doesn't work as x scale for ordinal (bins) and
# x scale for continuous (mpg) do not match
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p <- p + geom_rug(data = mtcars, aes(x = mpg), inherit.aes = F, alpha = 0.3)
p
Now, try and rescale the mpg column to match with the ordinal scale....
First define a linear mapping function...
fn_linear_map <- function(vct_existing_val, vct_new_range) {
# example....converts 1:20 into the range 1 to 10 like this:
# fn_linear_map(1:20, c(1, 10))
fn_r_diff <- function(x) x %>% range() %>% diff()
flt_ratio <- fn_r_diff(vct_new_range) / fn_r_diff(vct_existing_val)
vct_old_min_offset <- vct_existing_val - min(vct_existing_val)
vct_new_range_val <- (vct_old_min_offset * flt_ratio) + min(vct_new_range)
return(vct_new_range_val)
}
Now apply the function...we try and map mpg to the range 1 to 4 (which is an attempt to match
the ordinal scale)
mtcars$mpg_remap <- fn_linear_map(mtcars$mpg, c(1, 4))
Try the plot again.... getting closer ... but not really accurate...
# attempt 3: getting closer but doesn't really match the ordinal scale
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p <- p + geom_rug(data = mtcars, aes(x = mpg_remap), inherit.aes = F, alpha = 0.3)
p
The graph above is getting close to what I want....but rug plot does not line up
with the actual data ... example the max observation (33.9) should be displayed
almost aligning with the right hand side of the bar.. see below:
mtcars %>% filter(bin == "(25,34]") %>% arrange(mpg) %>% dplyr::select(mpg, mpg_remap)
Your scale makes no sense to me, as you are showing a bin that is twice as wide using the same bar width. Doing that in combination with a rug strikes me as confusing as best and misleading at worst. I suggest you plot the bars with their correct widths, after which the rug is trivial.
I think the best solution is to just use geom_histogram:
ggplot(mtcars, aes(mpg)) +
geom_histogram(breaks = vct_seq, col = 'grey80') +
geom_rug(aes(mpg, y = NULL))
If you really want the gaps between the bars you'll have to do more work:
library(tidyr)
d <- mtcars %>%
count(bin) %>%
separate(bin, c('min', 'max'), sep = ',', remove = FALSE) %>%
mutate_at(vars('min', 'max'), readr::parse_number) %>%
mutate(
middle = min + (max - min) / 2,
width = 0.9 * (max - min)
)
ggplot(d, aes(middle, n)) +
geom_col(width = d$width) +
geom_rug(aes(mpg, y = NULL), mtcars)

Resources