Removing NA category from grouped bar charts - r

I am currently working with survey data with 250 columns. A sample of my data looks like this:
q1 <- factor(c("yes",NA,"no","yes",NA,"yes","no","yes"))
q2 <- factor(c("Albania","USA","Albania","Albania","UK",NA,"UK","Albania"))
q3 <- factor(c(0,1,NA,0,1,1,NA,0))
q4 <- factor(c(0,NA,NA,NA,1,NA,0,0))
q5 <- factor(c("Dont know","Prefer not to answer","Agree","Disagree",NA,"Agree","Agree",NA))
q6 <- factor(c(1,NA,3,5,800,NA,900,2))
sector <- factor(c("Energy","Water","Energy","Other","Other","Water","Transportation","Energy"))
data <- data.frame(q1,q2,q3,q4,q5,q6,sector)
I have created a function to loop through all 250 columns and create grouped bar charts where x axis shows sectors, y axis shows percentage distribution of answers and fill is the underlying column from data. Below you can see the code for the function:
by_sector <- lapply(names(data), function(variable) {
ggplot(
data = data,
mapping = aes(x=sector,fill = data[[variable]])
) +
geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge") +
labs(x = variable, y = "% of total", fill = "Response", caption = paste("Total =", sum(!is.na(data[[variable]])))) +
geom_text(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..], label=scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..],accuracy = 0.1) ),
stat="count", position=position_dodge(1), vjust=0.5)+
#scale_fill_brewer(palette = "Accent")+
scale_fill_discrete(na.translate = FALSE) +
theme_bw() +
theme(panel.grid.major.y = element_blank()) +
coord_flip()
})
As you can see from image below, since I use data columns as fill, there is transparent NA category showing up. I want to remove that category from grouped bars.
enter image description here
I tried couple of things:
scale_fill_discrete(na.translate = FALSE) This just removed NA from legend not from grouped bars.
fill = subset(data,!is.na(data[[variable]])) This didn't work
ggplot(data=na.omit(data[[variable]])) This didn't work neither.
Is there a way to modify my code for barplots so that NA category doesn't show up as a bar in the graph? Thank you very much beforehand!

One option would be to aggregate your data outside of ggplot() which makes it easier to debug, removes the duplicated computations inside the code and makes it easy to drop the NA categories if desired.
Additionally, I moved the plotting code to a separate function which also allows for easier debugging by e.g. running the code for just one example.
Finally note, that I switched to the .data pronoun as the recommend way to use column names passed as strings.
Showing only the plots for two of the problematic columns:
EDIT Fixed a small bug by removing the NA values before aggregating instead of doing that afterwards.
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
plot_fun <- function(variable) {
total <- sum(!is.na(data[[variable]]))
data <- data |>
filter(!is.na(.data[[variable]])) |>
group_by(across(all_of(c("sector", variable)))) |>
summarise(n = n(), .groups = "drop_last") |>
mutate(pct = n / sum(n)) |>
ungroup()
ggplot(
data = data,
mapping = aes(x = sector, y = pct, fill = .data[[variable]])
) +
geom_col(position = "dodge") +
labs(
x = variable, y = "% of total", fill = "Response",
caption = paste("Total =", total)
) +
geom_text(
aes(
label = scales::percent(pct, accuracy = 0.1)
),
position = position_dodge(.9), vjust = 0.5
) +
scale_fill_brewer(palette = "Accent") +
theme_bw() +
theme(panel.grid.major.y = element_blank()) +
coord_flip()
}
by_sector <- lapply(names(data), plot_fun)
by_sector[c(3, 6)]
#> [[1]]
#>
#> [[2]]

Related

Adding weights to barplot function in ggplot2

I am working with survey data with 250 columns. A sample of my data looks like this:
q1 <- factor(c("yes",NA,"no","yes",NA,"yes","no","yes"))
q2 <- factor(c("Albania","USA","Albania","Albania","UK",NA,"UK","Albania"))
q3 <- factor(c(0,1,NA,0,1,1,NA,0))
q4 <- factor(c(0,NA,NA,NA,1,NA,0,0))
q5 <- factor(c("Dont know","Prefer not to answer","Agree","Disagree",NA,"Agree","Agree",NA))
q6 <- factor(c(1,NA,3,5,800,NA,900,2))
sector <- factor(c("Energy","Water","Energy","Other","Other","Water","Transportation","Energy"))
weights <- factor(c(0.13,0.25,0.13,0.22,0.22,0.25,0.4,0.13)
data <- data.frame(q1,q2,q3,q4,q5,q6,sector,weights)
With the help from stackoverflow I have created following function to loop through columns and create bar charts where x axis shows percentage of responses, y axis shows underlying column and fill is the sectors.
plot_fun <- function(variable) {
total <- sum(!is.na(data[[variable]]))
data <- data |>
filter(!is.na(.data[[variable]])) |>
group_by(across(all_of(c("sector", variable)))) |>
summarise(n = n(), .groups = "drop_last") |>
mutate(pct = n / sum(n)) |>
ungroup()
ggplot(
data = data,
mapping = aes(fill = sector, x = pct, y = .data[[variable]])
) +
geom_col(position = "dodge") +
labs(
y = variable, x = "Percentage of responses", fill = "Sector legend",
caption = paste("Total =", total)
) +
geom_text(
aes(
label = scales::percent(pct, accuracy = 0.1)
),
position = position_dodge(.9), vjust = 0.5
) +
scale_x_continuous(labels=function(x) paste0(x*100))+
scale_fill_brewer(palette = "Accent")+
theme_bw() +
theme(panel.grid.major.y = element_blank())
}
Now I want to apply survey weights so that bar charts will show weighted response percentages. I have tried to add weight = data$weights to mapping() but it didn't work. I have also tried to apply weights in the calculation of percentages by doing summarise(n= sum(weights)) but it didn't work neither.
Is there a way to modify my code so that weights are applied? Thank you beforehand.
It's still not clear how you are looking to apply the weights. I've assumed here you want to multiply the percentage by the weight. Note you need to fix your data. Weight should not be factor if you want to use it as a numerical value for calculation. Anyhow, used weights in the group_by so that they carry through, and then in mutate to create a weighted percentage.
total <- sum(!is.na(data[[variable]]))
data <- data |>
filter(!is.na(.data[[variable]])) |>
group_by(across(all_of(c("sector", "weights", variable)))) |>
summarise(n = n(), .groups = "drop_last") |>
mutate(pct = n / sum(n), wpct = pct*weights) |>
ungroup()
ggplot(
data = data,
mapping = aes(fill = sector, x = wpct, y = .data[[variable]])
) +
geom_col(position = "dodge") +
labs(
y = variable, x = "Percentage of responses", fill = "Sector legend",
caption = paste("Total =", total)
) +
geom_text(
aes(
label = scales::percent(wpct, accuracy = 0.1)
),
position = position_dodge(.9), vjust = 0.5
) +
scale_x_continuous(labels=function(x) paste0(x*100))+
scale_fill_brewer(palette = "Accent")+
theme_bw() +
theme(panel.grid.major.y = element_blank())
}
If this doesn't do the trick, do clarify how you look to use the weights and what the final outcome values should be.

Adding a single label per group in ggplot with stat_summary and text geoms

I would like to add counts to a ggplot that uses stat_summary().
I am having an issue with the requirement that the text vector be the same length as the data.
With the examples below, you can see that what is being plotted is the same label multiple times.
The workaround to set the location on the y axis has the effect that multiple labels are stacked up. The visual effect is a bit strange (particularly when you have thousands of observations) and not sufficiently professional for my purposes. You will have to trust me on this one - the attached picture doesn't fully convey the weirdness of it.
I was wondering if someone else has worked out another way. It is for a plot in shiny that has dynamic input, so text cannot be overlaid in a hardcoded fashion.
I'm pretty sure ggplot wasn't designed for the kind of behaviour with stat_summary that I am looking for, and I may have to abandon stat_summary and create a new summary dataframe, but thought I would first check if someone else has some wizardry to offer up.
This is the plot without setting the y location:
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_x <- df_x %>%
group_by(Group) %>%
mutate(w_count = n())
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(label = w_count)) +
coord_flip() +
theme_classic()
and this is with my hack
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(y = 1, label = w_count)) +
coord_flip() +
theme_classic()
Create a df_text that has the grouped info for your labels. Then use annotate:
library(dplyr)
library(ggplot2)
set.seed(123)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_text <- df_x %>%
group_by(Group) %>%
summarise(avg = mean(Value),
n = n()) %>%
ungroup()
yoff <- 0.0
xoff <- -0.1
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
annotate("text",
x = 1:2 + xoff,
y = df_text$avg + yoff,
label = df_text$n) +
coord_flip() +
theme_classic()
I found another way which is a little more robust for when the plot is dynamic in its ordering and filtering, and works well for faceting. More robust, because it uses stat_summary for the text.
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
counts_df <- function(y) {
return( data.frame( y = 1, label = paste0('n=', length(y)) ) )
}
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
coord_flip() +
theme_classic()
p + stat_summary(geom="text", fun.data=counts_df)

Problem with naming x-axis with ggplot2 in Rstudio

I'm trying to create some variation of a pareto-chart.
Moving along the code I face a problem I cannot solve on my own for several hours. It's regarding the data order of the package ggplot2 (1) and renaming the labels accordingly(2).
(1)Since I want to create an ordered bar-plot with a saturation curve, I created a dummyvar from X to X-1, so my bars are sorted from high to low, as you can see in the output (1).
By maneuvering around this problem I created a second problem I can't fix.
(2)I have a column in my df containing all the species I want to see at the x-axis. However, ggplot won't allow to print those accordingly. Actually since I added the command I won't get any labeling on the x-axis. Somehow I will not get any error.
So my question is:
Is there a way to use my species list as x-axis?(But remember my data has to be sorted from high to low)
Or does some one easily spot a way to solve the labeling problem?
cheers
dfb
Beech id proc kommu Order
1 Va fla 1 8.749851 8.749851 Psocopt
2 Er 2 7.793812 16.543663 Acari
3 Faga dou 3 7.659406 24.203069 Dipt
4 Tro 4 6.675941 30.879010 Acari
5 Hal ann 5 6.289307 37.168317 Dipt
6 Stigm 6 3.724406 40.892723 Acari
7 Di fag 7 3.642574 44.535297 Lepidopt
8 Phyfa 8 3.390545 47.925842 Neoptera
9 Phylma 9 2.766040 50.691881 Lepidopt
data example:
structure(list(Beech = c("Va fla", "Er", "Faga dou", "Tro", "Hal ann",
"Stigm", "Di fag", "Phyfa", "Phylma"), id = c(1, 2, 3, 4, 5,
6, 7, 8, 9), proc = c(8.749851, 7.793812, 7.659406, 6.675941,
6.289307, 3.724406, 3.642574, 3.390545, 2.76604), kommu = c(8.749851,
16.543663, 24.203069, 30.87901, 37.168317, 40.892723, 44.535297,
47.925842, 50.691881), Order = c("Psocopt", "Acari", "Dipt",
"Acari", "Dipt", "Acari", "Lepidopt", "Neoptera", "Lepidopt")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
library(openxlsx)
library(ggplot2)
dfb <- data.xlsx ###(df containing different % values per species)
labelb <- dfb$Beech ###(list of 22 items; same number as x-values)
p <-ggplot(dfb, aes(x=id))
p <- p + geom_bar(aes(y = proc), stat = "identity", fill = "lightgreen")
p <- p + geom_line(aes(y = kommu/10), color = "orange", size = 2) + geom_point(aes(y = kommu/10),size = 2)
p <- p + scale_y_continuous(sec.axis = sec_axis(~.*10, name ="Total biocoenosis[%]"))
p <- p + labs(y = "Species [%]",
x = "Species")
p <- p + scale_x_discrete(labels = labelb)
p <- p + theme(legend.position = c(0.8, 0.9))
--> Answer to other comments:
So basically my problem is the bars are not labeled with a species name.
I know that this is a result due to my dummyvar, which is basically 1 to 22.
So I try to force ggplot to name the x-axis with my wanted values.
But this input doesn't work
p <- p + scale_x_discrete(labels = labelb)
But back to your suggestions:
Jeah, I tried tidyverse just after creating this post and couldn't handle it good enough. But your idea doesn't do anything for me, its like using the ggplot command.
arrange(Beech) %>%
mutate(Beech = factor(Beech, levels = unique(.$Beech))) %>%
ggplot(aes(Beech, proc)) +
geom_col()
I can't quite tell from the picture what's going wrong, but one way to make sure your bar plots are in ascending/descending order is to arrange the column and then convert it to a factor using the existing order of the categories:
So, without ordering:
library(tidyverse)
diamonds %>%
group_by(cut) %>%
summarize(price = mean(price)) %>%
ggplot(aes(cut, price)) +
geom_bar(stat = "identity")
And with ordering:
diamonds %>%
group_by(cut) %>%
summarize(price = mean(price)) %>%
arrange(price) %>%
mutate(cut = factor(cut, levels = unique(.$cut))) %>%
ggplot(aes(cut, price)) +
geom_bar(stat = "identity")
I edited your code with the database sample you provided and I think I was able to do what you wanted.
Basically I sorted Beech depending on the descending proc and then convert it to factor. Here is the modified code and the result:
p <-
dfb %>%
arrange(desc(proc)) %>%
mutate(Beech = factor(Beech, levels = unique(.$Beech))) %>%
ggplot(aes(Beech)) +
geom_bar(aes(y = proc), stat = "identity", fill = "lightgreen") +
geom_line(aes(y = kommu/10, x=as.integer(Beech)), color = "orange", size = 2) +
geom_point(aes(y = kommu/10),size = 2) +
labs(y = "Species [%]", x = "Species") +
scale_x_discrete("Species") +
scale_y_continuous(sec.axis = sec_axis(~.*10, name ="Total biocoenosis[%]")) +
theme(legend.position = c(0.8, 0.9))
p
Note: I had to tweak a bit the geom_line by adding x=as.integer(Beech) because it works with numbers and not factors.

Summarizing data into percentages for side-by-side Bar Charts in R

Below is the code I am having trouble with and its output. The data set is linked at the bottom of the post.
What I am wanting to do is group the StateCodes together with each MSN (opposite of what is showing now in the output).
plotdata <- EnergyData %>%
filter(MSN %in% c("BMTCB", "GETCB", "HYTCB", "SOTCB", "WYTCB")) %>%
filter(Year %in% c("2009")) %>%
select(StateCode, MSN, Data) %>%
group_by(StateCode) %>%
mutate(pct = Data/sum(Data),
lbl = scales::percent(pct))
plotdata
This outputs to:
I thought that the group_by function would do that for me but I would like to know if I am missing a key chunk of code?
Once the above chunk runs correctly, I want to create side by side Bar charts by StateCode using the percentages of each of the 5 MSN's.
Here's the code I have so far.
ggplot(EnergyData,
aes(x = factor(StateCode,
levels = c("AZ", "CA", "NM", "TX")),
y = pct,
fill = factor(drv,
levels = c("BMTCB", "GETCB", "HYTCB", "SOTCB", "WYTCB"),
labels = c("BMTCB", "GETCB", "HYTCB", "SOTCB", "WYTCB")))) +
geom_bar(stat = "identity",
position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .2),
label = pct) +
geom_text(aes(label = lbl),
size = 3,
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = "Set2") +
labs(y = "Percent",
fill = "MSN",
x = "State",
title = "Renewable Resources by State") +
theme_minimal()
As of now I believe this all has to do with how I create the percentages for the bar charts.
Any assistance would be great. Thank you!
Here's the data I used Energy Data http://www.mathmodels.org/Problems/2018/MCM-C/ProblemCData.xlsx
Here is a version using data.table for the initial filtering, and changes to the plot function that hopefully get you the result you are after:
library(readxl)
library(data.table)
library(ggplot2)
download.file("http://www.mathmodels.org/Problems/2018/MCM-C/ProblemCData.xlsx", "~/ex/ProblemCData.xlsx")
# by default, factor levels will be in alphabetical order, so we do not need to specify that
EnergyData <- data.table(read_xlsx("~/ex/ProblemCData.xlsx"), key="StateCode", stringsAsFactors = TRUE)
# filter by Year and MSN list
plotdata <- EnergyData[as.character(MSN) %chin% c("BMTCB", "GETCB", "HYTCB", "SOTCB", "WYTCB") & Year == 2009]
# calculate percentages of Data by StateCode
plotdata[, pct := Data/sum(Data), by = "StateCode"]
# plot using percent format and specified number of breaks
ggplot(plotdata,
aes(x = StateCode,
y = pct,
fill = MSN)) +
geom_bar(stat = "identity",
position = "fill") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), n.breaks = 6) +
scale_fill_brewer(palette = "Set2") +
labs(y = "Percent",
fill = "MSN",
x = "State",
title = "Renewable Resources by State") +
theme_minimal()
Created on 2020-03-20 by the reprex package (v0.3.0)

Revising ggplot after a function: non-numeric argument to binary operator error

I am attempting to produce a ggplot from within a function. I can do so using the sample data and code below.
If I produce the plot (p) outside of the function, I can revise it with no problem to add a title, subtitle, axis labels, etc. (e.g., p + labs(title = "Most frequent words, by gender")).
However, if I produce the plot from within the function and then attempt to modify it, I get the following error: non-numeric argument to binary operator.
In both cases, the object "p" shows up under Values.
I would of course like to use a function because I have a number of different group_by variables to test, and I want to eliminate typing mistakes (e.g., forgetting to change "gender" to "income" on a later analysis).
Can someone explain why the error arises only after modifying a ggplot created in a function? And of course I would be grateful for advice about how to eliminate the source of the error.
# sample data of favorite activities
df <- tibble(
word = c("walk","hike","garden","garden","walk","hike", "garden","hike","hike","hike","walk"),
gender = c("Male","Female","Female","Female","Male","Male","Male", "Male","Male","Female","Female")
)
df
# function to figure out the proportions of the activities
sum_text_prop <- function(df, groupbyvar) {
groupbyvar <- enquo(groupbyvar)
df %>%
count(!!groupbyvar, word, sort = TRUE) %>%
group_by(groupbyvar = !!groupbyvar) %>%
mutate(proportion = n / sum(n)) %>%
top_n(proportion, n = 5) %>%
ungroup()
}
# function to plot the most common words
plot_text_prop <- function(df) {
p <- ggplot(data = df, aes(x = word, y = proportion, fill = groupbyvar)) +
geom_bar(stat = "identity", alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ groupbyvar, ncol = 2, scales = "free") +
coord_flip()
print(p)
}
# deploy the functions
df %>%
sum_text_prop(groupbyvar = gender) %>%
plot_text_prop()
# add a title to the plot
p + labs(title = "Most frequent words, by gender")
# error: Error in p + labs(title = "Most frequent words, by gender") :
non-numeric argument to binary operator
Update
Thanks to the helpful responses, my revised code is as follows:
plot_text_prop <- function(df) {
ggplot(data = df, aes(reorder_within(word, proportion, groupbyvar),
proportion, fill = groupbyvar)) +
geom_bar(stat = "identity", alpha = 0.8, show.legend = FALSE) +
scale_x_reordered() +
facet_wrap(~ groupbyvar, ncol = 2, scales = "free") +
coord_flip()
}
p <- tidy_infl %>%
sum_text_prop(groupbyvar = gender) %>%
plot_text_prop()
p + labs(title = "Most frequent words, by gender")

Resources