I am new to R & ggplot2 and wondering if it is possible to do a population pyramid for Male & Female and comparing each of the gender across two time periods. Please see the screenshot for the details. I have found quite a few resources on this site that show how to build population pyraminds but they all use only one variable i.e. gender. I want to compare gender and time periods in the same chart.
Any help is greatly appreciated. Thank you.
Here is an idea. First you didn't prepare an example dataset. Therefore I created this df. Note that the Values (number of people) are negative for women.
df <- data.frame(Gender = rep(c("M", "F"), each = 20),
Age = rep(c("0-10", "11-20", "21-30", "31-40", "41-50",
"51-60", "61-70", "71-80", "81-90", "91-100"), 4),
Year = factor(rep(c(2009, 2010, 2009, 2010), each= 10)),
Value = sample(seq(50, 100, 5), 40, replace = TRUE)) %>%
mutate(Value = ifelse(Gender == "F", Value *-1 , Value))
Next step is to everything in a bar plot. The function interaction helps to color the bars by Gender and Year. In scale_fill_manual the color can be specified. Alternativly you can use fill = Gender and alpha = Year if you don't want to use the interaction.
ggplot(df) +
geom_col(aes(fill = interaction(Gender, Year, sep = "-"),
y = Value,
x = Age),
position = "dodge") +
scale_y_continuous(labels = abs,
expand = c(0, 0)) +
scale_fill_manual(values = hcl(h = c(15,195,15,195),
c = 100,
l = 65,
alpha=c(0.4,0.4,1,1)),
name = "") +
coord_flip() +
facet_wrap(.~ Gender,
scale = "free_x",
strip.position = "bottom") +
theme_minimal() +
theme(legend.position = "bottom",
panel.spacing.x = unit(0, "pt"),
strip.background = element_rect(colour = "black"))
Related
For this week's TidyTuesday challenge on the Erasmus dataset, I was trying to replicate a visualization that seemed cool to me. Yet, I had faced an issue with the ordering of columns in a bar plot and I specifically wanted to ask about a nuance that I could not grasp, although I solved it.
In this first visualization, the coloring of bars was based on their order with regard to their letters.
In the second one, the coloring was based on their order with regard to their total participants.
This all changed thanks to the following code when cleaning data.
mutate(country_name = factor(country_name, levels = rev(country_name)))
Here is also the whole code chunk;
filter (receiving_country_code == "TR") %>%
filter (sending_country_code != receiving_country_code) %>%
group_by(sending_country_code) %>%
summarize(total = sum(participants)) %>%
arrange(desc(total)) %>%
filter(total>9) %>%
left_join(country_codes, by = c("sending_country_code" = "iso2c")) %>%
mutate(country_name = case_when(
sending_country_code == "UK" ~ "United Kingdom",
sending_country_code == "EL" ~ "Greece",
TRUE ~ country_name
)) %>%
mutate(country_name = factor(country_name, levels = rev(country_name)))
p1 <- ggplot(data = top_incoming,
aes(x = country_name, y = total, fill = country_name)) +
geom_bar(width = 0.3, stat = "identity", show.legend = FALSE) +
geom_text(aes(x = country_name, y = 0, label = paste0(country_name, " - ", total)),
hjust = 1.05, size = 2, colour = rep(c("#173f5f", "#3caea3"), 10)) +
coord_polar(theta = "y", start = 0) +
ylim(c(0, 40)) +
scale_fill_manual(values = rep(c("#3caea3", "#173f5f"), 10)) +
labs(title = "Incoming Students",
subtitle = "2015-2019") +
theme_void() +
theme (plot.title = element_text(hjust = 0.5, family = "subtitle", size = 14),
plot.subtitle = element_text(hjust = 0.5, family = "subtitle", size = 12),
plot.caption = element_text(family = "caption"))
This all changed thanks to the following code when cleaning data.
Although I figured out what this code's function is, I could not understand its underlying logic, particularly what levels = rev(country_name) means considering that the order of the columns was primarily based on character level and, at best, I think, the rev() function should have reversed it instead of changing the order to be based on participants.
I have created the following plot:
From a bigger version (5 rows, 58 columns) of this df:
df <- data.frame(row.names = c("ROBERT", "FRANK", "MICHELLE", "KATE"), `1` = c(31, 87, 22, 12), `2` = c(37, 74, 33, 20), `3` = c(35, 32, 44, 14))
colnames(df) <- c("1", "2", "3")
In the following manner:
df = df %>%
rownames_to_column("Name") %>%
as.data.frame()
df <- melt(df , id.vars = 'Name', variable.name = 'ep')
ggplot(df, aes(ep,value)) + geom_line(aes(colour = Name, group=Name))
The plot kind of shows what I'd like to, but it really is a mess. Does anyone have a suggestion that would help me increasing its readability?
Any help is very much appreciated!
Here are a few options for visualizing lots of datapoints across a smallish number of cases. These are illustrated with a subset of the txhousing data included with ggplot2.
Solution 1: Faceting
As #rdelrossi suggested, one solution is to facet by Name:
library(ggplot2)
ggplot(df, aes(ep,value)) +
geom_line(aes(colour = Name, group=Name), show.legend = FALSE) +
scale_x_continuous(expand = c(0,0)) +
facet_wrap(vars(Name), ncol = 1, scales = "free_x") +
theme_bw()
Solution 2: Smoothing
Use geom_smooth() to smooth out local fluctuations to see larger longer-term trends:
ggplot(df, aes(ep,value)) +
geom_smooth(
aes(colour = Name, group=Name),
se = FALSE,
span = 1, # higher number = smoother
size = 1.25
) +
scale_x_date(expand = c(0,0)) +
theme_bw()
Solution 3: Lasagna
Sometimes called a "lasagna plot," this is a heatmap with cases on the y axis, time (or whatever) on the x axis, and values mapped to color. It's a different way of comparing changes within (left to right) and between (up and down) individuals.
ggplot(df, aes(ep, Name, colour = value, fill = value)) +
geom_tile(size = .5) +
scale_fill_viridis_c(option = "B", aesthetics = c("colour", "fill")) +
coord_cartesian(expand = FALSE) +
theme(
axis.text.y = element_text(size = 12, face = "bold"),
axis.title.y = element_blank()
)
(may want to click through to larger image)
Data prep:
library(dplyr)
library(lubridate)
df <- txhousing %>%
filter(
city %in% c("Beaumont", "Amarillo", "Arlington", "Corpus Christi", "El Paso"),
between(year, 2004, 2012)
) %>%
group_by(city) %>%
mutate(
Name = city,
value = scale(sales),
ep = ym(str_c(year, month))
) %>%
ungroup()
If your readability concern is just the x axis labels, then I think the main issue is that when you use reshape2::melt() the result is that the column ep is a factor which means that the x axis of your plot will show all the levels and get crowded. The solution is to convert it to numeric and then it will adjust the labels in a sensible way.
I replace your use of reshape2::melt() with tidyr::pivot_longer() which has superseded it within the {tidyverse} but your original code would still work.
library(tidyverse)
df <- structure(list(`1` = c(31, 87, 22, 12), `2` = c(37, 74, 33, 20), `3` = c(35, 32, 44, 14)), class = "data.frame", row.names = c("ROBERT", "FRANK", "MICHELLE", "KATE"))
df %>%
rownames_to_column("Name") %>%
pivot_longer(-Name, names_to = "ep") %>%
mutate(ep = as.numeric(ep)) %>%
ggplot(aes(ep, value, color = Name)) +
geom_line()
Created on 2022-03-07 by the reprex package (v2.0.1)
Another solution could be the use of a geom_bar()
Sample code:
ggplot(df, aes(fill=Name)) +
geom_bar(aes(x=ep, y=value, group=Name),stat="identity", position = position_dodge(width = 0.9)) +
labs(x="ep", y="count")+
scale_y_continuous(expand=c(0,0))+
theme_bw()
Plot:
Also you can add facet_grid(~Name)+
Also you can add
geom_text(aes(label=value), position = position_stack(vjust = .5))+
I have a post-election dataset where I have asked respondents to rate the tone of the campaign of six candidates (scale 0-10). For each candidate, I want to assess whether ratings of campaign tone differ depending on party identification. I basically have 6 scale variables for campaign tone ratings, one for each leader, and 6 dummy variables for party id, one for each party.
I have created 6 plots, showing mean differences based on party ID for each candidate.
This is the example of the code for each candidate1:
plotR <- data2 %>%
group_by(ptID_R) %>%
summarise(group_mean = mean(meanIncR, na.rm = TRUE), se = sd(meanIncR, na.rm = TRUE) / sqrt(length(meanIncR))) %>%
ggplot(aes(x = group_mean, y = ptID_R, color = ptID_R)) +
geom_point(size = 2) +
geom_errorbarh(aes(xmin = group_mean - se, xmax = group_mean + se), height= 0.05) +
labs(
title = "Mark Rutte***",
x = "Incivility (Mean)" ,
y = "Party ID (1 = VVD)",
color = ""
) +
theme_minimal() +
theme(
text = element_text(size=11),
plot.title = element_text(face = "bold", size=12, hjust = 0.5),
axis.title = element_text(size=11),
plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"))
I have arranged the individual charts into one plot using grid.arrange2:
plotAll <- grid.arrange(plotW, plotKl, plotR, plotKa, plotH, plotP,
nrow = 2, widths = c(1, 1, 1),
top = textGrob("Perceptions of Performed Incivility by Party ID",
gp = gpar(fontface = "bold", fontsize = 12))
However, I would like to show this a little differently. I want to plot all 6 chart together, but in a way that on the y axis I have all leaders, and on the x axis campaign ratings. It should look something like in the picture here linked 3
Can anybody help me with this problem? I don't really know how to go about it.
Unfortunately, I cannot share the dataset I am working with, but maybe you can still give me an idea on how to approach this.
Many thanks in advance!
Chiara
I think the option you might be looking for is position = position_dodge().
library(ggplot2)
# Make up some dummy data
df <- expand.grid(
leader = LETTERS[1:6],
party = factor(c(0, 1))
)
# Dummy scores
set.seed(0)
df$mean <- runif(nrow(df), 0, 10)
df$sd <- runif(nrow(df))
# Use position = position_dodge
ggplot(df, aes(y = leader, colour = party)) +
geom_point(aes(x = mean),
position = position_dodge(width = 0.8)) +
geom_errorbarh(aes(xmin = mean - sd, xmax = mean + sd),
position = position_dodge(width = 0.8))
Created on 2021-07-07 by the reprex package (v0.3.0)
I am trying to create a plot to track results over days for multiple factors. Ideally I would like my xaxis to be Day, with the day number centered in the middle of the reps for that particular day, the y axis to be result, and the facet will be the Lot (1-4). I am having difficulty making the day centered on the bottom using repeatable text, as the number of reps may vary.
I was using ideas shown in this post: Multi-row x-axis labels in ggplot line chart but have been unable to make any progress.
Here is some code I have been using and the plot that I have so far. The x axis is far too busy and I am trying to consolidate it.
data <- data.frame(System = rep(c("A", "B"), each = 120), Lot = rep(1:4, each = 30),
Day = rep(1:5, each = 6), Rep = rep(1:6, 40), Result = rnorm(240))
library(ggplot2)
ggplot(data, aes(x = interaction(Day, Rep, lex.order = TRUE), y = Result, color = System, group = System)) +
geom_point() +
geom_line() +
theme(legend.position = "bottom") +
facet_wrap(~Lot, ncol = 1) +
geom_vline(xintercept = (which(data$Rep == 1 & data$Day != 1)), color = "gray60")
I'm not 100% sure if this is exactly what you are after but this will center the day on the x-axis.
library(dplyr)
library(tidyr)
library(ggplot2)
df <- data.frame(System = rep(c("A", "B"), each = 120), Lot = rep(1:4, each = 30),
Day = rep(1:5, each = 6), Rep = rep(1:6, 40), Result = rnorm(240))
df <- df %>%
unite(Day_Rep, Day, Rep, sep = ".", remove = F) %>%
mutate(Day_Rep = as.numeric(Day_Rep))
ggplot(df, aes(x = Day_Rep, y = Result, color = System, group = System)) +
geom_point() +
geom_line() +
theme(legend.position = "bottom") +
facet_wrap(~Lot, ncol = 1) +
scale_x_continuous(labels = df$Day, breaks = df$Day + 0.5)+
geom_vline(xintercept = setdiff(unique(df$Day), 1))
I have plotted two facets one on top of the other with two different ys (a percentage and a cost) and the same x (Years). I took most of the ideas from this post and some variations of the same.
I'd like to show the labels of the y axis as percentages for the rate and as £ for the costs, but I have been unable to change each y label format independently.
Below a reproducible example using facet_grid (I managed to create a similar thing with facet_wrap but I get stuck with the same problem).
I considered using grid.arrange() from the gridExtra package, but it seemed that would bring other issues with the legend.
library(plyr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(scales)
set.seed(12345)
my_labels <- function(variable, value){
names_li <- list("percentage", "cost in pounds")
return(names_li[value])
}
df <- data.frame(
rate = runif(10, 0, 1),
cost = rnorm(10, 100, 40),
years = seq(from = 2001, to = 2010)
)
df %>%
gather(type_of_var,
value,
rate:cost) ->
df2
df2 %>%
ggplot(aes(x = years,
y = value,
ymin = 0,
ymax = .1)) +
facet_grid(type_of_var ~ .,
scales = 'free_y',
labeller = my_labels) +
labs(x = "Year",
y = "") +
geom_point(subset = . (type_of_var == "rate")) +
geom_line(subset = . (type_of_var == "rate"),
colour = "grey") +
## the following two lines don't work
# scale_y_continuous(subset = . (type_of_var == "rate"),
# labels = percent) +
geom_bar(subset = . (type_of_var == "cost"),
stat = "identity") +
theme_bw() +
theme(strip.text.y = element_text(size = 15,
colour = "black"),
plot.title = element_text(lineheight = 0.8,
face = "bold")) +
scale_x_continuous(breaks = seq(2001, 2010, 1)) +
labs(title = "free_y y axis labels")
Thanks
as a fragile workaround, you could use
label_fun <- function (x) {
if(max(x, na.rm=TRUE) > 1) dollar(x) else percent(x)
}
(assuming you only deal with big money and small percentages)