Population pyramid plot in R [duplicate] - r

This question already has answers here:
Simpler population pyramid in ggplot2
(4 answers)
Closed 5 months ago.
I am new to R and trying to create a population pyramid plot similar to the first one here https://klein.uk/teaching/viz/datavis-pyramids/. I have a dataset with two variables sex and age groups that looks like this:
sex age_group
1 Male 20-30
2 Female 50-60
3 Male 70-80
4 Male 10-20
5 Female 80-90
... ... ...
This is the code I used
ggplot(data = pyramid_graph(x = age_group, fill = sex)) +
geom_bar(data = subset(pyramid_graph, sex == "F")) +
geom_bar(data = subset(pyramid_graph, sex == "M")) +
mapping = aes(y = - ..count.. ),
position = "identity") +
scale_y_continuous(labels = abs) +
coord_flip()
I do not get any errors from R but when I execute this code a blank image is produced.
Can anyone help?
Thank you

Using a similar input dataset from the same website that you cite in your question:
# Obtain source data
load(url("http://klein.uk/R/Viz/popGH.RData"))
# Convert to summary table
df <- as_tibble(popGH) %>%
mutate(AgeDecade=as.factor(floor(AGE/10)*10)) %>%
group_by(SEX, AgeDecade) %>%
dplyr::summarise(N=n(), .groups="drop") %>%
# A more transparent way of managing the transformation to get "Females to the left".
mutate(PlotN=ifelse(SEX=="Female", -N, N))
# Create the plot
df %>% ggplot() +
geom_col(aes(fill=SEX, x=AgeDecade, y=PlotN)) +
scale_y_continuous(breaks=c(-2*10**5, 0, 2*10**5), labels=c("200k", "0", "200k")) +
labs(y="Population", x="Age group") +
scale_fill_discrete(name="Sex") +
coord_flip()
Gives
Note that I've created a new column to create the "females to the left" effect in the plot. Normally, I'd avoid doing that and would rely on the options to the various ggplot functions to achieve the same thing (much as you have attempted to do). However, in this case, I think it's far more transparent (and simple) use the extra column rather than to modify the mapping.

Related

R: how to filter within aes()

As an R-beginner, there's one hurdle that I just can't find the answer to. I have a table where I can see the amount of responses to a question according to gender.
Response
Gender
n
1
1
84
1
2
79
2
1
42
2
2
74
3
1
84
3
2
79
etc.
I want to plot these in a column chart: on the y I want the n (or its proportions), and on the x I want to have two seperate bars: one for gender 1, and one for gender 2. It should look like the following example that I was given:
The example that I want to emulate
However, when I try to filter the columns according to gender inside aes(), it returns an error! Could anyone tell me why my approach is not working? And is there another practical way to filter the columns of the table that I have?
ggplot(table) +
geom_col(aes(x = select(filter(table, gender == 1), Q),
y = select(filter(table, gender == 1), n),
fill = select(filter(table, gender == 2), n), position = "dodge")
Maybe something like this:
library(RColorBrewer)
library(ggplot2)
df %>%
ggplot(aes(x=factor(Response), y=n, fill=factor(Gender)))+
geom_col(position=position_dodge())+
scale_fill_brewer(palette = "Set1")
theme_light()
Your answer does not work, because you are assigning the x and y variables as if it was two different datasets (one for x and one for y). In line with the solution from TarJae, you need to think of it as the axis in a diagram - so you need for your x axis to assign the categorical variables you are comparing, and you want for the y axis to assign the numerical variables which determines the height of the bars. Finally, you want to compare them by colors, so each group will have a different color - that is where you include your grouping variable (here, I use fill).
library(dplyr) ## For piping
library(ggplot2) ## For plotting
df %>%
ggplot(aes(x = Response, y = n, fill = as.character(Gender))) +
geom_bar(stat = "Identity", position = "Dodge")
I am adding "Identity" because the default in geom_bar is to count the occurences in you data (i.e., if you data was not aggregated). I am adding "Dodge" to avoid the bars to be stacked. I will recommend you, to look at this resource for more information: https://r4ds.had.co.nz/index.html

How to make a continuous fill in a ggplot2 bar plot with one variable

I am using the library ggplot2movies for my data movies
Please keep in mind that I refer to mpaa rating and user rating, which are two different things. In case you don't want to load the ggplot2movies library, here is a sample of the relevant data:
> head(subset(movies[,c(5,17)], movies$mpaa!=""))
# A tibble: 6 x 2
rating mpaa
<dbl> <chr>
1 5.3 R
2 7.1 PG-13
3 7.2 PG-13
4 4.9 R
5 4.8 PG-13
6 6.7 PG-13
Here I make a barplot that shows the frequency of films that have any mpaa rating:
ggplot(data=subset(movies, movies$mpaa!=""), aes(mpaa)) +
geom_bar()
Now I would like to color in the bars with a fill, based on the imdb user rating. I don't want to use factor(rating) because there are an enormous number of different values in the rating column. However, when I try to use a continuous fill like in Assigning continuous fill color to geom_bar I get the same graph.
ggplot(data=subset(movies, movies$mpaa!=""), aes(mpaa, fill=rating)) +
geom_bar()+
scale_fill_continuous(low="blue", high="red")
I figure it has to do with the fact that my barplot is based on the frequency of a single variable, rather than a dataframe with a count column. I could make a new dataframe of the mpaa categories and their counts, but I'd rather know how to do this graph with the original movies dataset and a single ggplot.
Edit: Using aes(mpaa, group = rating, fill = rating) gives a chart that is almost correct, except that the bars and legend are swapped.
You can reverse the legend with: + guides(fill=guide_colourbar(reverse=TRUE)), however, a colour gradient doesn't seem very informative. Another option would be to cut rating into discrete ranges, as in the example below, which provides a more clear indication of the distribution of ratings within each mpaa category. Nevertheless, because of the different bar heights, it's not clear how the average rating or distribution of ratings varies by mpaa category.
library(tidyverse)
library(ggplot2movies)
theme_set(theme_classic())
movies %>%
filter(mpaa != "") %>%
mutate(rating = fct_rev(cut(rating, seq(0,ceiling(max(rating)),2)))) %>%
ggplot(aes(mpaa, fill=rating)) +
geom_bar(colour="white", size=0.2) +
scale_fill_manual(values=c(hcl(240,100,c(30,70)), "yellow", hcl(0,100,c(70,30))))
Perhaps a boxplot or violin plot would be more informative. In the boxplot example below, the box widths are proportional to the square root of the number of movies rated, due to the varwidth=TRUE argument (I'm not that wild about this because the square-root transformation is difficult to interpret, but I thought I'd put it out there as an option). In the violin plot, the area of each violin is proportional to the number of movies in each mpaa category (due to the scale="count" argument). I've also put the number of movies in each category in the x-axis label, and marked in blue the mean rating for each mpaa category.
p = movies %>%
filter(mpaa != "") %>%
group_by(mpaa) %>%
mutate(xlab = paste0(mpaa, "\n(", format(n(), big.mark=","), ")")) %>%
ggplot(aes(xlab, rating)) +
labs(x="MPAA Rating\n(number of movies)",
y="Viewer Rating") +
scale_y_continuous(limits=c(0,10))
pl = list(geom_boxplot(varwidth=TRUE, colour="grey70"),
geom_violin(colour="grey70", scale="count",
draw_quantiles=c(0.25,0.5,0.75)),
stat_summary(fun.y=mean, geom="text", aes(label=sprintf("%1.1f", ..y..)),
colour="blue", size=3.5))
gridExtra::grid.arrange(p + pl[-2], p + pl[-1], ncol=2)
I am not sure that the following is what you want.
When coloring by rating the default stat = "count" is not working so I transform the data.
library(ggplot2movies)
library(dplyr)
data("movies")
subset(movies, mpaa != "") %>%
group_by(mpaa) %>%
summarise(rating = sum(rating)) %>%
ggplot(aes(x = mpaa, y = rating, fill = rating)) +
geom_bar(stat = "identity") +
scale_fill_continuous(low="blue", high="red")

Pie charts within a list

Let me start off by saying that I know pie charts are terrible methods of accurately displaying data but I have been asked to produce this as part of a report. I have a data set that contains information about location, injury type, and then several fields of personal data. I would like to display a pie chart of the percentage of each type of injury that occurs at each location. I've tried this where facility2 is a list of 52 elements created by splitting the full dataframe by ServiceSite.x. This partially works but the pie charts created only contain the count for one "initial type".
summarized_list <- lapply(facility2, function(x){
x %>% group_by(InitialType) %>% summarize(length(InitialType))
})
pies <- function(z) {
ggplot(z, aes(x = "", fill = length(InitialType)))+
geom_bar(width = 1, na.rm = TRUE)+
coord_polar(theta = "y")
}
lapply(summarized_list, pies)
This also partially works and would be perfect, but only prints out 13 charts instead of all 52
pies2 <- function(x) {
ggplot(x, aes(x = "", fill = InitialType))+
geom_bar(width = 1, na.rm = TRUE)+
coord_polar(theta = "y")+
xlab(x$ServiceSite.x)
}
lapply(facility2, pies2)
and gives this error
Error: Aesthetics must be either length 1 or the same as the data (1): x, fill
I know the first method splits the data perfectly while providing correct counts, I just can't figure out what I need to change in the ggplot() to have all injury types display for each facility. I would also like to add a label of percentages if possible or at least just the counts.
Sample data:
ServiceSite.x InitialType
2 Dermatitis
2 Diabetic
2 Pressure Injury
2 Pressure Injury
3 Pressure Injury
3 Other
3 Laceration
3 Other
4 Pressure Injury
4 MASD
4 Blister (Non-Pressure)
4 Skin Tear
4 Pressure Injury
5 Skin Tear
5 Other
5 Contusion
5 Skin Tear
5 Surgical(Non-Healing)
5 Pressure Injury
6 Pressure Injury
1 Pressure Injury
6 Pressure Injury
6 MASD
1 Surgical(Non-Healing)
1 Pressure Injury
1 Skin Tear
1 Contusion
facility2 <- split(full, full$ServiceSite.x)
both variables are factors.
I'm not getting the error you're showing, so you might want to go through and find the culprit, which may not be in the sample you posted. I'm using purrr::map rather than lapply; that's partially preference, and partially how well it fits in with a piped workflow. I also find map functions easy for debugging, since it will print the name or index of each item as it's being mapped over; this often helps me figure out where in a list a problem is.
The first set of plots here just come from rewriting your code using purrr::imap, which maps over two lists: the list itself, and its names. split names the list based on the values in ServiceSite.x, so you now have access to them to set the xlab. I'm not sure if one bug might have been in setting the xlab with x$ServiceSite.x, which seems like it should have returned an entire vector, not just a single string.
library(tidyverse)
library(patchwork)
pies1 <- df %>%
split(.$ServiceSite.x) %>%
imap(function(data, site) {
ggplot(data, aes(x = "", fill = InitialType)) +
geom_bar(width = 1) +
coord_polar(theta = "y") +
xlab(site)
})
I'm using the patchwork library to stick all the plots together just for easier display here.
reduce(pies1, `+`) + plot_layout(ncol = 2, byrow = T)
For the labels, do a little data prep first to calculate counts and percentages. Here I did this with a couple dplyr functions. Then add a geom_text with position_stack(vjust = 0.5) so the texts will 1. stack going around the circle, the same as the bars do, and 2. be centered in the wedges. I'll leave it up to you to format the text as you want, including adding count labels instead or in addition.
pies2 <- df %>%
split(.$ServiceSite.x) %>%
imap(function(data, site) {
data %>%
count(ServiceSite.x, InitialType) %>%
mutate(share = round(n / sum(n), digits = 2)) %>%
ggplot(aes(x = "", y = n, fill = InitialType)) +
geom_col(width = 1) +
geom_text(aes(label = scales::percent(share)), position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y")
})
pies2[[1]]

grouped barplot: order x-axis & keep constant bar width, in case of missing levels

Here is my script (example inspired from here and using the reorder option from here):
library(ggplot2)
Animals <- read.table(
header=TRUE, text='Category Reason Species
1 Decline Genuine 24
2 Improved Genuine 16
3 Improved Misclassified 85
4 Decline Misclassified 41
5 Decline Taxonomic 2
6 Improved Taxonomic 7
7 Decline Unclear 10
8 Improved Unclear 25
9 Improved Bla 10
10 Decline Hello 30')
fig <- ggplot(Animals, aes(x=reorder(Animals$Reason, -Animals$Species), y=Species, fill = Category)) +
geom_bar(stat="identity", position = "dodge")
This gives the following output plot:
What I would like is to order my barplot only on condition 'Decline', and all the 'Improved' would not be inserted in the middle. Here is what I would like to get (after some svg editing):
So now all the whole 'Decline' condition is sorted and the 'Improved' condition comes after. Besides, ideally, the bars would all be at the same width, even if the condition is not represented for the value (e.g. "Bla" has no "Decline" value).
Any idea on how I could do that without having to play with SVG editors? Many thanks!
First let's fill your data.frame with missing combinations like this.
library(dplyr)
Animals2 <- expand.grid(Category=unique(Animals$Category), Reason=unique(Animals$Reason)) %>% data.frame %>% left_join(Animals)
Then you can create an ordering variable for the x-scale:
myorder <- Animals2 %>% filter(Category=="Decline") %>% arrange(desc(Species)) %>% .$Reason %>% as.character
An then plot:
ggplot(Animals2, aes(x=Reason, y=Species, fill = Category)) +
geom_bar(stat="identity", position = "dodge") + scale_x_discrete(limits=myorder)
Define new data frame with all combinations of "Category" and "Reason", merge with data of "Species" from data frame "Animals". Adapt ggplot by correct scale_x_discrete:
Animals3 <- expand.grid(Category=unique(Animals$Category),Reason=unique(Animals$Reason))
Animals3 <- merge(Animals3,Animals,by=c("Category","Reason"),all.x=TRUE)
Animals3[is.na(Animals3)] <- 0
Animals3 <- Animals3[order(Animals3$Category,-Animals3$Species),]
ggplot(Animals3, aes(x=Animals3$Reason, y=Species, fill = Category)) + geom_bar(stat="identity", position = "dodge") + scale_x_discrete(limits=as.character(Animals3[Animals3$Category=="Decline","Reason"]))
To achieve something like that I would adjust the data frame when working with ggplot. Add the missing categories with a value of zero.
Animals <- rbind(Animals,
data.frame(Category = c("Improved", "Decline"),
Reason = c("Hello", "Bla"),
Species = c(0,0)
)
)
Along the same lines as the answer from user Alex, a less manual way of adding the categories might be
d <- with(Animals, expand.grid(unique(Category), unique(Reason)))
names(d) <- names(Animals)[1:2]
Animals <- merge(d, Animals, all.x=TRUE)
Animals$Species[is.na(Animals$Species)] <- 0

ggplot: percentage counts line graph for factor groups on a scale

Say I want to plot percentages of "yes" answers to a question, across different age groups in ggplot. These age groups are obviously factors, but I want them to be shown in a scale-like fashion, so want to use a line graph.
Here's some data:
mydata <- data.frame(
age_group = c("young", "middle", "old"),
question = sample(c("yes", "no"), 99, replace = TRUE))
mydata$age_group = factor(mydata$age_group,levels(mydata$age_group)[c(3, 1, 2)])
mydata$question = factor(mydata$question,levels(mydata$question)[c(2,1)])
So far, I have been using this code to generate a stacked barplot:
ggplot(mydata, aes(age_group, fill = question)) + geom_bar(position = "fill")
How could I change this into a line graph, with just the frequency counts of the "yes" answers? Mark in the answers suggests a workaround which produces the right output:
But I hoping there was a way to do this automatically in one line of code, rather than creating this summary table first.
If I understood correctly, this does what you want:
ggplot(mydata) +
stat_bin(aes(x=age_group, color=question, group=question), geom="line")
Note this doesn't look exactly the same as yours in terms of yes/no because you didn't set a seed for the random numbers.
If you just want the percentages of "yes" for each category, I suggest changing your data to the following:
question age_group value percent
1 yes young 14 0.4242424
3 yes middle 17 0.5151515
5 yes old 20 0.6060606
Using this code to summarize the data:
library(reshape)
mydata.summary = melt(xtabs(~question+age_group,data=mydata))
mydata.summary2 = mydata.summary[mydata.summary$question=="yes",]
mydata.summary2$percent <- mydata.summary2$value/melt(xtabs(~age_group,data=mydata))$value
ggplot(mydata.summary2, aes(age_group,percent, group = question, colour=question)) + geom_line()

Resources