Cumulative stacked bar plot with the same variable with ggplot2 - r

I got a data frame producers with two colums: person_id and year.
# A tibble: 3,207 x 2
person_id year
<chr> <chr>
1 GASH1991-04-30 2020
2 LOSP1969-06-29 2020
3 CRGM1989-08-26 2020
4 CEVE1954-07-15 2020
5 HERR1998-01-06 2020
6 TOLR1951-04-09 2020
7 BEAM1953-09-07 2020
8 ANRJ1977-07-06 2020
9 PAMH1982-02-06 2020
10 AKTE1967-11-15 2020
# ... with 3,197 more rows
I can summarise this dataframe to obtain cumulative sum:
producers %>%
select(person_id, year) %>%
group_by(year) %>%
distinct(person_id) %>%
summarise(total = n()) %>%
ungroup() %>%
mutate(cum = cumsum(total))
# A tibble: 3 x 3
year total cum
<chr> <int> <int>
1 2019 456 456
2 2020 1832 2288
3 2021 160 2448
An I can make a cummulative bar plot like this:
ggplot(producers, aes(x = as.factor(year), y = as.integer(cum))) +
geom_bar(position = "stack", stat = "identity") +
ylim(0,3000) +
xlab("Year") +
ylab("Producers") +
theme_classic()
But what I really want is something like this:
I've been trying with aes(fill = year) and other arguments but I can't get it. Thanks for your responses.

Here's an approach. Ultimately, we'll need two "year" variables, one to mark the category within each stack, and one to mark which stack we want it to appear in. Here, I set up year2 for the 2nd one, and filter out the values that shouldn't appear yet in each stack.
df2 <- data.frame(
year = 2019:2021,
total = c(456, 1832, 160)
)
library(tidyverse)
df2 %>%
crossing(year2 = df2$year) %>% # make copy for each year
filter(year <= year2) %>% # keep just the years up to current year
ggplot(aes(year2, total, fill = fct_rev(as.factor(year)))) +
geom_col() +
scale_fill_discrete(name = "Year")

ggplot2 works best with data in a long format where you have one variable to plot and then various identifying variables to control the fill, color, and facetting. Here I explicitly build a repeated data frame using map_dfr which essentially is running a for loop for each year in the input dataset. In dat_long the new column yearid becomes the x-axis identifier so within 2021 we can access the data for year 2019 through 2021 to control the color fill.
library(ggplot2)
library(dplyr)
library(purrr)
library(forcats)
year = c(2019, 2020, 2021)
sum = c(456, 1832, 160)
cumsum = c(456, 2288, 2448)
dat <- data.frame(year, sum)
# note: don't need the cumsum column
# instead, create long, replicated data where we repeat
# each years entry for every year that comes after it
dat_long <-
map_dfr(unique(dat$year),
~filter(dat, year <= .x) %>%
mutate(yearid = .x))
ggplot(data = dat_long,
aes(x = yearid,
y = sum,
# note: use factor to get discrete color palette, fct_rev to stack 2021 on top
fill = fct_rev(factor(year)))) +
geom_col()

Related

Plotting dummy variables with ggplot2

I actually need help building on this question:
ggplot2 graphic order by grouped variable instead of in alphabetical order.
I need to produce a similar graph and I actually have a problem with the black points. I have data where column names are dates and rows are filled with 0 or 1 and I need to plot the point if the value is 1. To reproduce, here is a small sample (in my dataset, there is over 300 columns):
df <- data.frame(id=c(1,2,3),
"26April1970"=c(0,0,1),
"14August1970"=c(0,1,0))
I need to plot the dates on the x axis, match the id to the canton and show the points where the value is 1.
Could anyone help?
Try this:
plot_data = df %>%
## put data in long format
pivot_longer(-id, names_to = "colname") %>%
## keep only 1s
filter(value == 1) %>%
## convert dates to Date class
mutate(date = as.Date(colname, format = "%d%B%Y"))
plot_data
# # A tibble: 2 x 4
# id colname value date
# <dbl> <chr> <dbl> <date>
# 1 2 14August1970 1 1970-08-14
# 2 3 26April1970 1 1970-04-26
## plot
ggplot(plot_data, aes(x = date, y = factor(id))) +
geom_point()
Using this data:
df <- data.frame(id=c(1,2,3),
"26April1970"=c(0,0,1),
"14August1970"=c(0,1,0), check.names = FALSE)
Maybe you are looking for this:
library(ggplot2)
library(dplyr)
library(tidyr)
#Data
df <- data.frame(id=c(1,2,3),
"26April1970"=c(0,0,1),
"14August1970"=c(0,1,0))
#Code
df %>% pivot_longer(-id) %>%
ggplot(aes(x=name,y=factor(value)))+
geom_point(aes(color=factor(value)))+
scale_color_manual(values=c('transparent','black'))+
theme(legend.position = 'none')+xlab('Date')+ylab('value')
Output:

Further scaling plotted data by year into intervals

I am practicing visualizing data with R with a dataset on certain incidents worldwide. I created a data frame only containing the number of incidents per year with the plyr count function.
library(plyr)
df_incidents <- count(df$iyear)
names(df_incidents)[names(df_incidents) == "x"] <- "year"
names(df_incidents)[names(df_incidents) == "freq"] <- "incidents"
df_incidents
Output:
year incidents
1970 651
1971 471
1972 568
1973 473
1974 581
... all the way to 2018
I visualised the above data with ggplot(df_incidents,aes(x=year,y=incidents)) + geom_bar(stat="identity") which returned a histogram of incidents per year, but I am unable to further group year into intervals of 5 years.
Should I alter my ggplot statement to scale the data or further process my df_incidents data frame into distinctive groups of approx 5 years from 1970?
You can try an approach using bars with scale_x_continuous() or using a new variable defined by cut() function. Here the approaches:
library(ggplot2)
library(dplyr)
set.seed(123)
#Data
df_incidents <- data.frame(year=1978:2018,
incidents=round(runif(41,500,1000),0))
#Plot option 1
ggplot(df_incidents,aes(x=year,y=incidents))+
geom_bar(stat = 'identity',color='black',fill='cyan')+
scale_x_continuous(breaks = seq(1978,2018,by=5))
Output:
And the second approach:
#Plot option 2
df_incidents %>%
mutate(Cutyear=cut(year,breaks = seq(1978,2018,by=5),include.lowest = T,right = F)) %>%
group_by(Cutyear) %>%
summarise(incidents=sum(incidents,na.rm=T)) %>%
ggplot(aes(x=Cutyear,y=incidents))+
geom_bar(stat = 'identity',color='black',fill='cyan')
Output:
You can bin your years into categories like below:
df_incidents %>% mutate(Binned= cut(year, breaks = c(1970, 1975, ....,2020))) %>%
group_by(Binned) %>% summarize(Incidents= count(incidents)) %>%
ggplot(.,aes(x=Binned,y=Incidents)) + geom_bar(stat="identity")

How to plot "count" and "identity" in the same graph [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a list of decimal numbers, ranging from 1 to 40K and I am trying to plot a frequency histogram together with the total sum of a given bin. I'm attempting to do it using ggplot2 but getting lost on how to use the same x axis bins from the histogram:
sales <- data.frame(amount = runif(100, min=0, max=40000))
h <- hist(sales$amount, breaks=b)
sales$groups <- cut(sales$amount, breaks=h$breaks)
ggplot(sales,aes(x=groups)) +
geom_bar(stat="count")+
geom_bar(aes(x=groups, y=amount), stat="identity") +
scale_y_continuous(sec.axis = sec_axis(~.*5, name = "sum"))
I managed to create both graphs independently, but they seem to overwrite each other.
or
If I understand right, you tried to plot two different variables (Count and Sum) in the bar graph. As they have really different ranges, you need to define a secondary y axis.
First, the grammar of ggplot2 asks for one for column for x values, one column for y values and one or several columns for groups (I'm doing a very brief and dirty summary of my understanding of how ggplot2 works).
Here, the idea is to have your "breaks" as x variable, a second column with all y values to be plot and a group column stipulating if a y value belongs to the group "Count" or "amount". You can achieve this using dplyr and tidyr packages:
set.seed(123)
sales <- data.frame(amount = runif(100, min=0, max=40000))
b = 4
h <- hist(sales$amount, breaks=b)
sales$groups <- cut(sales$amount, breaks=h$breaks)
library(tidyr)
library(dplyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
pivot_longer(.,cols = c(Count, amount), names_to = "Variable", values_to = "Value")
# A tibble: 200 x 3
# Groups: groups [4]
groups Variable Value
<fct> <chr> <dbl>
1 (1e+04,2e+04] Count 27
2 (1e+04,2e+04] amount 11503.
3 (3e+04,4e+04] Count 27
4 (3e+04,4e+04] amount 31532.
5 (1e+04,2e+04] Count 27
6 (1e+04,2e+04] amount 16359.
7 (3e+04,4e+04] Count 27
8 (3e+04,4e+04] amount 35321.
9 (3e+04,4e+04] Count 27
10 (3e+04,4e+04] amount 37619.
# … with 190 more rows
However, if you are trying to plot this straight you will get a bad plot with bars for "Count" really small compared to "amount":
library(ggplot2)
library(tidyr)
library(dplyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
pivot_longer(.,cols = c(Count, amount), names_to = "Variable", values_to = "Value")%>%
ggplot(aes(x=groups, y = Value, fill = Variable)) +
geom_bar(stat="identity", position = position_dodge())
So, you can try to pass a secondary y axis using sec.axis argument in scale_y_continuous. However, this won't change your plot, it will simply create a "fake" right axis with the scale modify by the value you pass on the argument sec.axis:
So, if you want to have both group of values visible on your graph you need to either scale down "amount" or scale up "Count" in order that both group have a similar range of values.
Here, as you want to have the sum on the right axis, we will scale down the "Sum" in order it get values in the same range than "Count" values.
On the graph, you can see that "amount" values is reaching around 40000 whereas the maximal value of "Count" is 30. So, you can choose the following scale factor: 40000 / 30 = 1333.333.
So, now, if you create a second column called "Amount" that is the result of "amount" divided by 1300, you will have "Amount" and "Count" on the same range. So, your data will looks like that now:
library(dplyr)
library(tidyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
mutate(Amount = amount /1300) %>%
pivot_longer(.,cols = c(Count, Amount), names_to = "Variable", values_to = "Value")
# A tibble: 200 x 4
# Groups: groups [4]
amount groups Variable Value
<dbl> <fct> <chr> <dbl>
1 24000. (2e+04,3e+04] Count 30
2 24000. (2e+04,3e+04] Amount 18.5
3 13313. (1e+04,2e+04] Count 30
4 13313. (1e+04,2e+04] Amount 10.2
5 19545. (1e+04,2e+04] Count 30
6 19545. (1e+04,2e+04] Amount 15.0
7 38179. (3e+04,4e+04] Count 20
8 38179. (3e+04,4e+04] Amount 29.4
9 19316. (1e+04,2e+04] Count 30
10 19316. (1e+04,2e+04] Amount 14.9
# … with 190 more rows
In order the secondary y axis reflect the reality of "amount" values, you can pass the opposite scale factor and multiply it by 1300.
Altogether, you get the following code:
library(ggplot2)
library(dplyr)
library(tidyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
mutate(Amount = amount /1300) %>%
pivot_longer(.,cols = c(Count, Amount), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x=groups, y = Value, fill = Variable)) +
geom_bar(stat="identity", position = position_dodge()) +
scale_y_continuous(name = "Count",sec.axis = sec_axis(~.*1300, name = "Sum"))
Thus, you have the illusion to have plot two different group of values on two different scales.
Hope that this long explanation was helpful for you.

copy factor level order from one column to another

I have two columns in a data.frame, that should have levels sorted in the same order, but I don't know how to do it in a straightforward manner.
Here's the situation:
library(ggplot2)
library(dplyr)
library(magrittr)
set.seed(1)
df1 <- data.frame(rating = sample(c("GOOD","BAD","AVERAGE"),10,T),
div = sample(c("A","B","C"),10,T),
n = sample(100,10,T))
# I'm adding a label column that I use for plotting purposes
df1 <- df1 %>% group_by(rating) %>% mutate(label = paste0(rating," (",sum(n),")")) %>% ungroup
# # A tibble: 10 x 4
# rating div n label
# <fctr> <fctr> <int> <chr>
# 1 BAD C 48 BAD (220)
# 2 BAD B 87 BAD (220)
# 3 BAD C 44 BAD (220)
# 4 GOOD B 25 GOOD (77)
# 5 AVERAGE B 8 AVERAGE (117)
# 6 AVERAGE C 10 AVERAGE (117)
# 7 AVERAGE A 32 AVERAGE (117)
# 8 GOOD B 52 GOOD (77)
# 9 AVERAGE C 67 AVERAGE (117)
# 10 BAD C 41 BAD (220)
# rating levels are sorted
df1$rating <- factor(df1$rating,c("BAD","AVERAGE","GOOD"))
ggplot(df1,aes(x=rating,y=n,fill=div)) + geom_col() # plots in the order I want
ggplot(df1,aes(x=label,y=n,fill=div)) + geom_col() # doesn't because levels aren't sorted
How do I manage to copy the factor order from one column to another ?
I can make it work this way but I think it's really awkward:
lvls <- df1 %>% select(rating,label) %>% unique %>% arrange(rating) %>% extract2("label")
df1$label <- factor(df1$label,lvls)
ggplot(df1,aes(x=label,y=n,fill=div)) + geom_col()
Instead of adding a label column and use aes(x = label, you may stick to aes(x = rating, and create the labels in scale_x_discrete:
ggplot(df1, aes(x = rating, y = n, fill = div)) +
geom_col() +
scale_x_discrete(labels = df1 %>%
group_by(rating) %>%
summarize(n = sum(n)) %>%
mutate(lab = paste0(rating, " (", n, ")")) %>%
pull(lab))
Once you have set the levels of rating, you can use forcats to set the levels of label by the order of rating like this...
library(forcats)
df1 <- df1 %>% group_by(rating) %>%
mutate(label=paste0(rating," (",sum(n),")")) %>%
ungroup %>%
arrange(rating) %>% #sort by rating
mutate(label=fct_inorder(label)) #set levels by order in which they appear
Or you can use forcats::fct_reorder to do the same thing...
df1$label <- fct_reorder(df1$label, as.numeric(df1$rating))
The plot then has the bars in the right order.

Subgroup axes ggplot2 similar to Excel PivotChart

I am trying plot Months with Year as subgroups to a chart with ggplot2. That is, something that looks like this:
A similar question was answered here, but I am hoping there is a better way that avoids hardcoding the axis labels.
The R code for the data frame is as follows:
set.seed(100)
df = data.frame( Year = rep(c(rep(2013, 12), rep(2014, 9)), 2)
, Month = rep(rep(month.abb, 2)[1:21], 2)
, Condition = rep(c("A", "B"), each=21)
, Value = runif(42))
As a bonus, I would appreciate learning how to plot smoothed totals by year without introducing a new variable (if this is possible?). If I use dplyr to summarise and group_by Year and Month, the order of the months is not preserved.
Notice, Month now starts at Apr:
group_by(df, Year, Month) %>% summarise(total = sum(Value)) %>% head
Source: local data frame [6 x 3]
Groups: Year
Year Month total
1 2013 Apr 0.4764846
2 2013 Aug 0.9194172
3 2013 Dec 1.2308575
4 2013 Feb 0.7960212
5 2013 Jan 1.0185700
6 2013 Jul 1.6943562
try this,
df$Month <- factor(df$Month, levels=month.abb)
p <- ggplot(df, aes(Month, Value, colour=Condition, group=Condition))+
facet_grid(.~Year) + geom_line() + theme_minimal()
library(gtable)
g <- ggplotGrob(p)
g2 <- g[-3,] %>%
gtable_add_rows(heights = g$heights[3], nrow(g)-3) %>%
gtable_add_grob(g[3,], t = nrow(g)-2, l=1, r=ncol(g))
grid.newpage()
grid.draw(g2)

Resources