How to make stacked bar chart using ggplot2? [duplicate] - r

This question already has answers here:
Create stacked barplot where each stack is scaled to sum to 100%
(5 answers)
Closed 10 months ago.
So I am having trouble making a stacked bar chart showing proportion of cases vs deaths.
This is the data:
df <- structure(list(Date = structure(c(19108, 19108, 19108, 19108,
19108, 19108, 19108, 19108, 19108, 19108), class = "Date"), Country = c("US",
"India", "Brazil", "France", "Germany", "United Kingdom", "Russia",
"Korea, South", "Italy", "Turkey"), Confirmed = c(81100599L,
43065496L, 30378061L, 28605614L, 24337394L, 22168390L, 17887152L,
17086626L, 16191323L, 15023662L), Recovered = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L), Deaths = c(991940L, 523654L, 663108L,
146464L, 134489L, 174778L, 367692L, 22466L, 162927L, 98720L),
Active = c(80108659L, 42541842L, 29714953L, 28459150L, 24202905L,
21993612L, 17519460L, 17064160L, 16028396L, 14924942L)), row.names = c(163539L,
163431L, 163375L, 163414L, 163418L, 163537L, 163496L, 163444L,
163437L, 163533L), class = "data.frame")
and I want to generate something that looks like this except with proportions of deaths vs cases.

This is a modification of #Allan Cameron's answer with adding the percent label and some other different approaches:
library(tidyverse)
library(scales)
df %>%
rename_with(., ~str_replace_all(., 'top10.', '')) %>%
pivot_longer(
cols = -Country,
names_to = "Status",
values_to = "value",
values_transform = list(value = as.integer)
) %>%
mutate(Status = fct_rev(fct_infreq(Status))) %>%
group_by(Country) %>%
mutate(pct= prop.table(value) * 100) %>%
ggplot(aes(x= Country, y = pct, fill=Status)) +
geom_col(position = position_fill())+
scale_fill_manual(values = c("#ff34b3", "#4976ff")) +
scale_y_continuous(labels = scales::percent)+
ylab("Percentage") +
geom_text(aes(label=paste0(sprintf("%1.1f", pct),"%")),
position=position_fill(vjust = 0.1)) +
ggtitle("Your Title")

I had to use OCR to convert the image of your data into actual data I could use. It's far better to include your data as text for this reason.
The plot is not particularly informative because the percentages are low, and difficult to read, but in any case, you can do it like this:
library(tidyverse)
p <- df %>%
mutate(top10.Confirmed = top10.Confirmed - top10.Deaths,
top10.Country = factor(top10.Country, top10.Country)) %>%
rename(Country = top10.Country,
Survived = top10.Confirmed,
Died = top10.Deaths) %>%
pivot_longer(-Country, names_to = "Outcome", values_to = "Count") %>%
mutate(Outcome = factor(Outcome, c("Survived", "Died"))) %>%
ggplot(aes(Country, Count, fill = Outcome)) +
geom_col(position = "fill") +
scale_fill_manual(values = c("#4976ff", "#ff34b3")) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Covid outcomes by country", y = "Percent")
p
To make it easier to read, you could zoom into the bottom:
p + coord_cartesian(ylim = c(0, 0.05))
Data in reproducible format
df <- structure(list(top10.Country = c("US", "India", "Brazil", "France",
"Germany", "United Kingdom", "Russia", "Korea, South", "Italy",
"Turkey"), top10.Confirmed = c(81100599L, 43065496L, 30378061L,
28605614L, 24337394L, 22168390L, 17887152L, 17086626L, 16191323L,
15023662L), top10.Deaths = c(991940L, 523654L, 663108L, 146464L,
134489L, 174778L, 367692L, 22466L, 162927L, 98720L)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
df
#> top10.Country top10.Confirmed top10.Deaths
#> 1 US 81100599 991940
#> 2 India 43065496 523654
#> 3 Brazil 30378061 663108
#> 4 France 28605614 146464
#> 5 Germany 24337394 134489
#> 6 United Kingdom 22168390 174778
#> 7 Russia 17887152 367692
#> 8 Korea, South 17086626 22466
#> 9 Italy 16191323 162927
#> 10 Turkey 15023662 98720
Created on 2022-05-01 by the reprex package (v2.0.1)

Related

ordering of ggplot not working with factors

I have created a vector with the order of a dot plot mentioned but it doesn't plot in that order/ Thanks for the suggestions.
order <- sav %>%
filter(Subject == "Food") %>%
arrange(desc(Percentage)) %>%
select(Location) %>%
unlist() %>%
unname()
order <- replace(order, c(1, 8), order[c(8, 1)])
sav %>%
ggplot(aes(x = factor(Location, levels = order), y = Percentage,
color = Subject))+
geom_point(data = filter(sav, Location != "IRELAND"),
size = 4, position = position_dodge(0.5))+
geom_point(data = filter(sav, Location == "IRELAND"),
size = 6, position = position_dodge(1))+
geom_linerange(data = filter(sav, Location == "IRELAND"),
aes(ymin = 0, ymax = Percentage),
position = position_dodge(1),
linetype = "dotdash") +
geom_linerange(data = filter(sav, Location != "IRELAND"),
aes(ymin = 0, ymax = Percentage),
position = position_dodge(0.5), linetype = "dotdash") +
coord_flip()+
ggtitle(label = "Increase in inflation (by CPI) in Ireland compared to OECD and other countries in OECD")+
xlab("Countries --> ") +
ylab("Increase in CPI by % -->")+
scale_y_continuous(breaks = round(seq(0, 20, by = 1),1))+
scale_color_manual(name = "Type of Items",
labels = c("Food", "Total", "Excluding food and energy"),
values=c(unname(colorblind_colors[2]),
unname(colorblind_colors[3]),
unname(colorblind_colors[4])))+
theme(panel.grid.major.x = element_line(linewidth =.01, color="black"),
panel.grid.major.y = element_blank(),
legend.position = "top"
)
> dput(order)
c("IRELAND", "NETHERLANDS", "SPAIN", "OECD", "ITALY", "FRANCE",
"UNITED STATES", "GERMANY", "CANADA")
> dput(sav)
structure(list(Location = c("CANADA", "CANADA", "FRANCE", "FRANCE",
"GERMANY", "GERMANY", "IRELAND", "IRELAND", "ITALY", "ITALY",
"NETHERLANDS", "NETHERLANDS", "SPAIN", "SPAIN", "UNITED STATES",
"UNITED STATES", "OECD", "OECD", "CANADA", "ITALY", "SPAIN",
"FRANCE", "IRELAND", "UNITED STATES", "NETHERLANDS", "OECD",
"GERMANY"), Subject = c("Food", "Total", "Food", "Total", "Food",
"Total", "Food", "Total", "Food", "Total", "Food", "Total", "Food",
"Total", "Food", "Total", "Food", "Total", "Total_Minus_Food_Energy",
"Total_Minus_Food_Energy", "Total_Minus_Food_Energy", "Total_Minus_Food_Energy",
"Total_Minus_Food_Energy", "Total_Minus_Food_Energy", "Total_Minus_Food_Energy",
"Total_Minus_Food_Energy", "Total_Minus_Food_Energy"), Frequency = c("Monthly",
"Monthly", "Monthly", "Monthly", "Monthly", "Monthly", "Monthly",
"Monthly", "Monthly", "Monthly", "Monthly", "Monthly", "Monthly",
"Monthly", "Monthly", "Monthly", "Monthly", "Monthly", "Monthly",
"Monthly", "Monthly", "Monthly", "Monthly", "Monthly", "Monthly",
"Monthly", "Monthly"), Time = c("2022-12", "2022-12", "2022-12",
"2022-12", "2022-12", "2022-12", "2022-12", "2022-12", "2022-12",
"2022-12", "2022-12", "2022-12", "2022-12", "2022-12", "2022-12",
"2022-12", "2022-12", "2022-12", "2022-12", "2022-12", "2022-12",
"2022-12", "2022-12", "2022-12", "2022-12", "2022-12", "2022-12"
), Percentage = c(11.02015, 6.319445, 12.86678, 5.850718, 19.75631,
8.550855, 11.74636, 8.224299, 13.14815, 11.63227, 16.7983, 9.586879,
15.68565, 5.70769, 11.88275, 6.454401, 15.60381, 9.438622, 5.58275,
4.469475, 4.442303, 3.36004, 4.999758, 5.707835, 6.034873, 7.221961,
5.05511)), class = "data.frame", row.names = c(NA, -27L))
A couple of things:
The factors need to be in the data that ggplot sees at every geom, but while you're setting factor(Location,levels=order) in the first mapping, none of the data= arguments is using the same data.
For this, I generally prefer factorizing the data up-front, and using ~-style "functions" for data=.
Not sure exactly why this is the case, but it still doesn't work ... but if the first call to geom_point uses mutate and replaces IRELAND's percentage values with NA, the levels are retained. Weird.
### I still don't have these :-)
colorblind_colors <- 1:4
sav %>%
mutate(Location = factor(Location, levels = order)) %>%
ggplot(aes(x = Location, y = Percentage, color = Subject)) +
geom_point(data = ~ mutate(., Percentage = if_else(Location == "Ireland", Percentage[NA], Percentage)),
size = 4, position = position_dodge(0.5), na.rm = TRUE) +
geom_point(data = ~ filter(., Location == "IRELAND"),
size = 6, position = position_dodge(1)) +
geom_linerange(data = ~ filter(., Location == "IRELAND"),
aes(ymin = 0, ymax = Percentage),
position = position_dodge(1),
linetype = "dotdash") +
geom_linerange(data = ~ filter(., Location != "IRELAND"),
aes(ymin = 0, ymax = Percentage),
position = position_dodge(0.5), linetype = "dotdash") +
coord_flip()+
ggtitle(label = "Increase in inflation (by CPI) in Ireland compared to OECD and other countries in OECD")+
xlab("Countries --> ") +
ylab("Increase in CPI by % -->")+
scale_y_continuous(breaks = round(seq(0, 20, by = 1),1))+
scale_color_manual(name = "Type of Items",
labels = c("Food", "Total", "Excluding food and energy"),
values=c(unname(colorblind_colors[2]),
unname(colorblind_colors[3]),
unname(colorblind_colors[4])))+
theme(panel.grid.major.x = element_line(linewidth =.01, color="black"),
panel.grid.major.y = element_blank(),
legend.position = "top"
)

How can I plot the values on the map?

The following code works for the specific dataset that is "world":
ggplot(data = world) +
geom_sf(aes(fill = pop_est)) +
scale_fill_viridis_c(option = "plasma", trans = "sqrt")
I would like to replace "world" with my own dataset where I can see the SuPDem level across the map for each country.
My data set:
structure(list(Country = c("Albania", "Albania", "Albania", "Albania",
"Albania", "Albania"), Year = 1998:2003, SupDem = c(0.956826521693282,
0.936230742033029, 0.903573815990819, 0.876945257628013, 0.856216104588584,0.807742885231119), Supdem_u95 = c(1.90310875913895, 1.85879856969654,1.78495639758744, 1.65257180642367, 1.56308076745783, 1.51389360690687), Supdem_l95 = c(0.034448436601662, 0.0443012513174743, 0.0257741517619924,0.0691486153748455, 0.187039084923293, 0.0595276656884577), Supdem_sd = c(0.472026289177333, 0.464124184907683, 0.441013943388078, 0.402542004940216, 0.348425295168507, 0.377196905776233), ISO3c = c("ALB", "ALB", "ALB", "ALB", "ALB","ALB"), v2x_regime = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,6L), class = "data.frame")
OP clarified in chat, we want to use the location data from the package rnaturalearth and plot the SupDem value for each country for the year 2020.
library(dplyr)
plotdata <- dplyr::inner_join(world, filter(myData, Year == 2020), by = c("admin" = "Country"))
ggplot(data = plotdata) +
geom_sf(aes(fill = SupDem)) +
scale_fill_viridis_c(option = "plasma", trans = "sqrt")

R studio Barplot

I have the following dataframe:
structure(list(share.beer = c(0.277, 0.1376, 0.1194, 0.0769,
0.0539, 0.0361, 0.0361, 0.0351, 0.0313, 0.03, 0.0119, 0.0084,
0.007, 0.0069), country = c("Brazil", "China, mainland", "United States",
"Thailand", "Vietnam", "China, mainland", "China, mainland",
"China, mainland", "China, mainland", "Argentina", "Indonesia",
"China, mainland", "China, mainland", "India"), Beer = c("soyb",
"maiz", "soyb", "cass", "cass", "whea", "rape", "soyb", "rice",
"soyb", "cass", "cott", "swpo", "rape")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -14L))
I want to create a barplot so that the beer type appears in the legend, the countries as y values while the share.beer are my values to be filled.
I have tried in various ways, including the following code, but I can't get the result I would like to. Here, for instance, I kept the variable "Beer""
df %>%
pivot_longer(cols = -Country, values_to = "Count", names_to = "Type") %>%
ggplot() +
geom_col(aes(x = reorder(Country, -Count), y = Count, fill = Beer))
However, I get an error
Can't combine share beer and Beer .
Any help?
You actually don't need the pivot_longer to create a suitable dataframe. You can use the following code:
library(tidyverse)
df %>%
ggplot() +
geom_col(aes(x = reorder(country, -share.beer), y = share.beer, fill = Beer)) +
xlab("Country") +
ylab("Share beer") +
coord_flip()
Output:

I am trying to add a smooth trend line using linear regression, Help me i have time series data

ggplot()+
geom_line(data=combined,aes(x=reorder(dates,value),y=value,color=variable,group=variable))+
scale_y_continuous(labels=function(x) format(x,scientific=FALSE))+
theme_gray()+
theme(axis.text.x = element_text(angle=90),
plot.title=element_text(hjust=0.5),plot.subtitle =
element_text(hjust = 0.5))+
annotate(geom = 'text',x='03.5.20',y=150000,label=
'WHO declares Covid-19 a Pandemic')+
annotate(geom = 'point',x='03.11.20',y=125865,size=6,shape=21,fill='blue')+
labs(title='Cases in China vs World ',x='Daily trend from January to March',y='Case Numbers ',
subtitle = 'Data From Jan 22,2020 - Mar 23,2020')
This is my regular table that works, now i am trying to add a smooth trend line using linear regression. I tried using stat_smooth(method='lm',formula=?) but the example i am working with uses y-x.
my problem is that on my x axis, I have dates, i am not sure where to go from here.
This is the data i am using
dates variable value
1 01.22.20 World 555
2 01.23.20 World 653
3 01.24.20 World 941
4 01.25.20 World 1434
5 01.26.20 World 2118
6 01.27.20 World 2927
7 01.28.20 World 5578
8 01.29.20 World 6166
9 01.30.20 World 8234
10 01.31.20 World 9927
63 01.22.20 China 548
64 01.23.20 China 643
65 01.24.20 China 920
66 01.25.20 China 1406
67 01.26.20 China 2075
68 01.27.20 China 2877
69 01.28.20 China 5509
70 01.29.20 China 6087
71 01.30.20 China 8141
72 01.31.20 China 9802
Any tips on how to approach this would be appreciated.
You can use geom_smooth, and if you do not want to use color in the smooth you can set the aesthetics only to geom_line
Data
df <- structure(list(dates = structure(c(18283, 18284, 18285, 18286,
18287, 18288, 18289, 18290, 18291, 18292, 18283, 18284, 18285,
18286, 18287, 18288, 18289, 18290, 18291, 18292), class = "Date"),
variable = c("World", "World", "World", "World", "World",
"World", "World", "World", "World", "World", "China", "China",
"China", "China", "China", "China", "China", "China", "China",
"China"), value = c(555L, 653L, 941L, 1434L, 2118L, 2927L,
5578L, 6166L, 8234L, 9927L, 548L, 643L, 920L, 1406L, 2075L,
2877L, 5509L, 6087L, 8141L, 9802L)), class = "data.frame", row.names = c(NA,
-20L))
Code
df %>%
ggplot(aes(x=reorder(dates,value),y=value,color=variable,group=variable))+
geom_line()+
geom_smooth(method = "lm", se = FALSE)
General smooth
df %>%
ggplot(aes(x=reorder(dates,value),y=value))+
geom_line(aes(color=variable,group=variable))+
geom_smooth(aes(group = 1),method = "lm", se = FALSE)
For not overlapping dates,
df <- df %>%
mutate(dates = as.Date(dates, format = "%m.%d.%y"))
df %>%
ggplot(aes(x = dates, y = value, group = variable, color = variable)) +
geom_line() +
geom_smooth(data = subset(df, variable == "China"), aes( dates, value), method = "lm", se = FALSE) +
scale_y_continuous(labels=function(x) format(x,scientific=FALSE))+
theme_gray()+
theme(plot.title=element_text(hjust=0.5),plot.subtitle =
element_text(hjust = 0.5)) +
annotate(geom = 'text',x= as.Date('03.5.20', "%m.%d.%y"),y=150000,label=
'WHO declares Covid-19 a Pandemic')+
annotate(geom = 'point',x=as.Date('03.11.20', "%m.%d.%y"),y=125865,size=6,shape=21,fill='blue')+
labs(title='Cases in China vs World ',x='Daily trend from January to March',y='Case Numbers ',
subtitle = 'Data From Jan 22,2020 - Mar 23,2020')

Summarizing a specific column with dplyr

For my assignment I need to create an object which contains, for each
combination of Sex and Season, the number of different sports in the olympics data set. The columns of this object should be called Competitor_Sex, Olympic_Season, and Num_Sports,
respectively.
This is what I have at the moment:
object <- olympics %>%
group_by(Sex, Season) %>%
summarise(Num_Sports = ???)
I'm having trouble with defining the third column, which is the number of sports. My data looks like this:
structure(list(Name = c("A Lamusi", "Juhamatti Tapio Aaltonen",
"Andreea Aanei", "Jamale (Djamel-) Aarrass (Ahrass-)", "Nstor Abad Sanjun"
), Sex = c("M", "M", "F", "M", "M"), Age = c(23L, 28L, 22L, 30L,
23L), Height = c(170L, 184L, 170L, 187L, 167L), Weight = c(60,
85, 125, 76, 64), Team = c("China", "Finland", "Romania", "France",
"Spain"), NOC = c("CHN", "FIN", "ROU", "FRA", "ESP"), Games = c("2012 Summer",
"2014 Winter", "2016 Summer", "2012 Summer", "2016 Summer"),
Year = c(2012L, 2014L, 2016L, 2012L, 2016L), Season = c("Summer",
"Winter", "Summer", "Summer", "Summer"), City = c("London",
"Sochi", "Rio de Janeiro", "London", "Rio de Janeiro"), Sport = c("Judo",
"Ice Hockey", "Weightlifting", "Athletics", "Gymnastics"),
Event = c("Judo Men's Extra-Lightweight", "Ice Hockey Men's Ice Hockey",
"Weightlifting Women's Super-Heavyweight", "Athletics Men's 1,500 metres",
"Gymnastics Men's Individual All-Around"), Medal = c(NA,
"Bronze", NA, NA, NA)), row.names = c("1", "2", "3", "4",
"5"), class = "data.frame")
This is probably solved in an easy way. Could someone help me? Would be appreciated a lot!
Best Regards,
Grouping twice should work:
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n())
You can use the equivalent of length(unique( from dplyr: n_distinct:
olympics %>%
group_by(Sex, Season) %>%
summarise(Sports = n_distinct(Sport)) %>%
rename(Competitor_Sex = Sex, Olympic_Season = Season) # To rename the columns

Resources