Can someone provide me some hints as to what I am doing wrong in my code? Or what I need to correct to get the correct percentages? I am trying to get the proportions by manipulating my ggplot2 code. I would prefer not mutating a column. However, if I can't get ggplot2 to give me the correct proportions, I will then be open to adding columns.
Here is the reproduceable data:
cat_type<-c("1", "1","2","3","1","3", "3","2","1","1","1","3","3","2","3","2","3","1","3","3","3","1","3","1","3","1","1","3","1")
country<-c("India","India","India","India","India","India","India","India","India","India","Indonesia","Russia","Indonesia","Russia","Russia","Indonesia","Indonesia","Indonesia","Indonesia","Russia","Indonesia","Russia","Indonesia","Indonesia","Russia", "Russia", "India","India","India")
bigcats<-data.frame(cat_type=cat_type,country=country)
My data gives me the following proportions (these are correct):
> table(bigcats$cat_type, bigcats$country) ## raw numbers
India Indonesia Russia
1 7 3 2
2 2 1 1
3 4 5 4
>
> 100*round(prop.table(table(bigcats$cat_type, bigcats$country),2),3) ## proportions by column total
India Indonesia Russia
1 53.8 33.3 28.6
2 15.4 11.1 14.3
3 30.8 55.6 57.1
However, my ggplot2 is giving me the incorrect proportions:
bigcats %>% ggplot(aes(x=country, y = prop.table(stat(count)), fill=cat_type, label = scales::percent(prop.table(stat(count)))))+
geom_bar(position = position_fill())+
geom_text(stat = "count", position = position_fill(vjust=0.5),colour = "white", size = 5)+
labs(y="Percent",title="Top Big Cat Populations",x="Country")+
scale_fill_discrete(name=NULL,labels=c("Siberian/Bengal", "Other wild cats", "Puma/Leopard/Jaguar"))+
scale_y_continuous(labels = scales::percent)
The issue is that using prop.table(stat(count)) will not compute the proportions by categories or your countries, i.e. you do:
library(dplyr)
bigcats %>%
count(cat_type, country) %>%
mutate(pct = scales::percent(prop.table(n)))
#> cat_type country n pct
#> 1 1 India 7 24.1%
#> 2 1 Indonesia 3 10.3%
#> 3 1 Russia 2 6.9%
#> 4 2 India 2 6.9%
#> 5 2 Indonesia 1 3.4%
#> 6 2 Russia 1 3.4%
#> 7 3 India 4 13.8%
#> 8 3 Indonesia 5 17.2%
#> 9 3 Russia 4 13.8%
Making use of a helper function to reduce code duplication you could compute your desired proportions like so:
library(ggplot2)
prop <- function(count, group) {
count / tapply(count, group, sum)[group]
}
ggplot(bigcats, aes(
x = country, y = prop(after_stat(count), after_stat(x)),
fill = cat_type, label = scales::percent(prop(after_stat(count), after_stat(x)))
)) +
geom_bar(position = position_fill()) +
geom_text(stat = "count", position = position_fill(vjust = 0.5), colour = "white", size = 5) +
labs(y = "Percent", title = "Top Big Cat Populations", x = "Country") +
scale_fill_discrete(name = NULL, labels = c("Siberian/Bengal", "Other wild cats", "Puma/Leopard/Jaguar")) +
scale_y_continuous(labels = scales::percent)
Created on 2021-07-28 by the reprex package (v2.0.0)
Related
I'm having trouble getting a trend line to appear with on a scatter plot of data structured the following way:
cohort count
<chr> <int>
1 1989 5
2 1990 7
3 1991 4
4 1992 4
5 1993 8
6 1994 7
This is the code I used to produce the plot:
ggplot(bipoc, aes(x = cohort, y = count)) +
geom_point() +
geom_smooth(method = 'lm') +
theme_classic()
I have a pretty basic df in which I have calculated the rank-change of values between two timestamps:
value rank_A rank_B group
1 A 1 1 A
2 B 2 3 A
3 C 3 2 B
4 D 4 4 B
5 E 5 8 A
6 F 6 5 C
7 G 7 6 C
8 H 8 7 A
What makes it a bit tricky (for me) is plotting the values on the Y-axis.
ggplot(df_alluvial, aes(y = value, axis1 = rank_A, axis2 = rank_B))+
geom_alluvium(aes(fill = group), width = 1/12)+
...
As of now, I can plot the rank-change and the groups successfully, but they are not linked to my value-names - there are no axis names and I don't know how to add them.
In the end it should look similiar to this:
https://www.reddit.com/r/GraphicalExcellence/comments/4imh5f/alluvial_diagram_population_size_and_rank_of_uk/
Thanks for your advice!
Your update made the question more clear to me.
The y parameter should be a numerical value, and the data should be in 'long' format. I'm not sure how to change your data to fulfill these requirements. Therefore, I create some new data in this example. I have tried to make the data similar to the data in the plot that you have linked to.
Labels and stratum refer to the city-names. You can use geom_text to label the strata.
# Load libraries
library(tidyverse)
library(ggalluvial)
# Create some data
df_alluvial <- tibble(
city = rep(c("London", "Birmingham", "Manchester"), 4),
year = rep(c(1901, 1911, 1921, 1931), each = 3),
size = c(0, 10, 100, 10, 15, 100, 15, 20, 100, 30, 25, 100))
# Notice the data is in long-format
df_alluvial
#> # A tibble: 12 x 3
#> city year size
#> <chr> <dbl> <dbl>
#> 1 London 1901 0
#> 2 Birmingham 1901 10
#> 3 Manchester 1901 100
#> 4 London 1911 10
#> 5 Birmingham 1911 15
#> 6 Manchester 1911 100
#> 7 London 1921 15
#> 8 Birmingham 1921 20
#> 9 Manchester 1921 100
#> 10 London 1931 30
#> 11 Birmingham 1931 25
#> 12 Manchester 1931 100
ggplot(df_alluvial,
aes(x = as.factor(year), stratum = city, alluvium = city,
y = size,
fill = city, label = city))+
geom_stratum(alpha = .5)+
geom_alluvium()+
geom_text(stat = "stratum", size = 3)
If you want to sort the cities based on their size, you can add decreasing = TRUE to all layers in the plot.
ggplot(df_alluvial,
aes(x = as.factor(year), stratum = city, alluvium = city,
y = size,
fill = city, label = city))+
geom_stratum(alpha = .5, decreasing = TRUE)+
geom_alluvium(decreasing = TRUE)+
geom_text(stat = "stratum", size = 3, decreasing = TRUE)
Created on 2019-11-08 by the reprex package (v0.3.0)
I want to chart the relative no of fatalities by year for each of various event types.
I can do with with facets in ggplot but am struggling to calculate the % By Event based on Event, Year and no of fatalities.
Event Type Year Fatalities % by Event
(calculated)
----- ---- ---------- ----------
Storm 1980 5 12.5%
Storm 1981 9 22.5%
Storm 1982 15 37.5%
Storm 1983 11 27.5%
Ice 1980 7 70%
Ice 1981 3 30%
I have the following code to calculate it, but the calculation is not working with the % using a much higher denominator.
fatalitiesByYearType <- stormDF %>%
group_by(eventType) %>%
mutate(totalEventFatalities = sum(FATALITIES)) %>%
group_by(year, add = TRUE) %>%
mutate(fatalitiesPct = sum(FATALITIES) / totalEventFatalities)
What am I doing wrong?
My charting as a below. I include this in case as I'm also interested to see whether there is a way of showing data in a proportionate way within ggplot.
p <- ggplot(data = fatalitiesByYearType,
aes(x=factor(year),y=fatalitiesPct))
p + geom_bar(stat="identity") +
facet_wrap(.~eventType, nrow = 5) +
labs(x = "Year",
y = "Fatalities",
title = "Fatalities by Type")
Maybe I do not get your problem, but we can start from here:
library(dplyr)
library(ggplot2)
# here the dplyr part
dats <- fatalitiesByYearType %>%
group_by(eventType) %>%
mutate(totalEventFatalities = sum(FATALITIES)) %>%
group_by(year, add = TRUE) %>%
# here we add the summarise
summarise(fatalitiesPct = sum(FATALITIES) / totalEventFatalities)
dats
# A tibble: 6 x 3
# Groups: eventType [?]
eventType year fatalitiesPct
<fct> <int> <dbl>
1 Ice 1980 0.7
2 Ice 1981 0.3
3 Storm 1980 0.125
4 Storm 1981 0.225
5 Storm 1982 0.375
6 Storm 1983 0.275
You can clearly merge everything in an unique dplyr chain:
# here the ggplot2 part
p <- ggplot(dats,aes(x=factor(year),y=fatalitiesPct)) +
geom_bar(stat="identity") +
facet_wrap(.~eventType, nrow = 5) +
labs(x = "Year", y = "Fatalities", title = "Fatalities by Type") +
# here we add the % in the plot
scale_y_continuous(labels = scales::percent)
With data:
fatalitiesByYearType <- read.table(text = "eventType year FATALITIES
Storm 1980 5
Storm 1981 9
Storm 1982 15
Storm 1983 11
Ice 1980 7
Ice 1981 3 ",header = T)
My question is similar to this one, except that my data are different. In my case, I was not able to use the solution given. I would expect points to show up on my map coloured according to the cut() values. Could someone point me in the right direction?
> test
# A tibble: 10 × 5
TC1 TC2 Lat Long Country
<dbl> <dbl> <dbl> <dbl> <fctr>
1 2.9 2678.0 50.62980 -95.60953 Canada
2 1775.7 5639.9 -31.81889 123.19389 Australia
3 4.4 5685.6 -10.10449 38.54364 Tanzania
4 7.9 NA 54.81822 -99.91685 Canada
5 11.2 2443.0 7.71667 -7.91667 Cote d'Ivoire
6 112.1 4233.4 -17.35093 128.02609 Australia
7 4.4 114.6 45.21361 -67.31583 Canada
8 8303.5 4499.9 46.63626 -81.39866 Canada
9 100334.8 2404.5 46.67291 -93.11937 USA
10 NA 1422.9 -17.32921 31.28224 Zimbabwe
ggplot(data = test, aes(x= Long, y= Lat)) +
borders("world", fill="gray75", colour="gray75", ylim = c(-60, 60)) +
geom_point(aes(size=TC2, col=cut(TC1, c(-Inf, 1000, 5000, 50000, Inf)))) +
# scale_colour_gradient(limits=c(100, 1000000), low="yellow", high="red") +
scale_color_manual(name = "TC1",
values = c("(-Inf,1000]" = "green",
"(1000,5000]" = "yellow",
"(5000,50000]" = "orange",
"(50000, Inf]" = "red"),
labels = c("up to 1", "1 to 5", "5 to 50", "greater than 50")) +
theme(legend.position = "right") +
coord_quickmap()
Warning message:
Removed 10 rows containing missing values (geom_point).
You were almost there! It's just the names of the 'cut' factors that are incorrect. If you try:
cut(test$TC1, c(-Inf, 1000, 5000, 50000, Inf))
# [1] (-Inf,1e+03] (1e+03,5e+03] (-Inf,1e+03] (-Inf,1e+03] (-Inf,1e+03]
# [6] (-Inf,1e+03] (-Inf,1e+03] (5e+03,5e+04] (5e+04, Inf] <NA>
# Levels: (-Inf,1e+03] (1e+03,5e+03] (5e+03,5e+04] (5e+04, Inf]
As you see the names of the levels are a bit different from what you are typing.
library(ggplot2)
ggplot(data = test, aes(x = Long, y = Lat)) +
borders("world", fill="gray75", colour="gray75", ylim = c(-60, 60)) +
geom_point(aes(size=TC2, color = cut(TC1, c(-Inf, 1000, 5000, 50000, Inf)))) +
scale_color_manual(name = "TC1",
values = c("(-Inf,1e+03]" = "green",
"(1e+03,5e+03]" = "yellow",
"(5e+03,5e+04]" = "orange",
"(5e+04, Inf]" = "red"),
labels = c("up to 1", "1 to 5", "5 to 50", "greater than 50")) +
theme(legend.position = "right") +
coord_quickmap()
#> Warning: Removed 2 rows containing missing values (geom_point).
Data:
test <- read.table(text = 'TC1 TC2 Lat Long Country
1 2.9 2678.0 50.62980 -95.60953 Canada
2 1775.7 5639.9 -31.81889 123.19389 Australia
3 4.4 5685.6 -10.10449 38.54364 Tanzania
4 7.9 NA 54.81822 -99.91685 Canada
5 11.2 2443.0 7.71667 -7.91667 "Cote d\'Ivoire"
6 112.1 4233.4 -17.35093 128.02609 Australia
7 4.4 114.6 45.21361 -67.31583 Canada
8 8303.5 4499.9 46.63626 -81.39866 Canada
9 100334.8 2404.5 46.67291 -93.11937 USA
10 NA 1422.9 -17.32921 31.28224 Zimbabwe', header = T)
I have a data set as below and I have created a graph with below code as suggested in a previous question. What I want to do is order the bars by rankings rather than team names. Is that possible to do in ggplot?
Team Names PLRankingsReverse Grreserve
Liverpool 20 20
Chelsea 19 19
Manchester City 15 18
Arsenal 16 17
Tottenham 18 16
Manchester United 8 15
Everton 10 14
Watford 13 13
Burnley 17 12
Southampton 9 11
WBA 11 10
Stoke 4 9
Bournemouth 12 8
Leicester 7 7
Middlesbrough 14 6
C. Palace 6 5
West Ham 1 4
Hull 3 3
Swansea 5 2
Sunderland 2 1
And here is the code:
alldata <- read.csv("premierleague.csv")
library(ggplot2)
library(reshape2)
alldata <- melt(alldata)
ggplot(alldata, aes(x = Team.Names, y= value, fill = variable), xlab="Team Names") +
geom_bar(stat="identity", width=.5, position = "dodge")
Thanks for the help!
In this case you need to sort your data frame prior to melting and capture the order. You can then use this to set the limit order on scale_x_discrete, or you can factor Team Name in your aes string.
Using factor:
ordr <- order(alldata$`Team Names`, alldata$PLRankingsReverse, decreasing = TRUE)
alldata <- melt(alldata)
ggplot(alldata, aes(x = factor(`Team Name`, ordr), y = value, fill = variable) +
labs(x = "Team Name") +
geom_bar(stat = "identity", width = .5, position = "dodge")
Using scale_x_discrete:
ordr <- alldata$`Team Name`[order(alldata$PLRankingsReverse, decreasing = TRUE)]
alldata <- melt(alldata)
ggplot(alldata, aes(x = `Team Name`, y = value, fill = variable) +
labs(x = "Team Name") +
geom_bar(stat = "identity", width =. 5, position = "dodge") +
scale_x_discrete(limits = ordr)