r - Calculate % within a Sub Group using Dplyr - r

I want to chart the relative no of fatalities by year for each of various event types.
I can do with with facets in ggplot but am struggling to calculate the % By Event based on Event, Year and no of fatalities.
Event Type Year Fatalities % by Event
(calculated)
----- ---- ---------- ----------
Storm 1980 5 12.5%
Storm 1981 9 22.5%
Storm 1982 15 37.5%
Storm 1983 11 27.5%
Ice 1980 7 70%
Ice 1981 3 30%
I have the following code to calculate it, but the calculation is not working with the % using a much higher denominator.
fatalitiesByYearType <- stormDF %>%
group_by(eventType) %>%
mutate(totalEventFatalities = sum(FATALITIES)) %>%
group_by(year, add = TRUE) %>%
mutate(fatalitiesPct = sum(FATALITIES) / totalEventFatalities)
What am I doing wrong?
My charting as a below. I include this in case as I'm also interested to see whether there is a way of showing data in a proportionate way within ggplot.
p <- ggplot(data = fatalitiesByYearType,
aes(x=factor(year),y=fatalitiesPct))
p + geom_bar(stat="identity") +
facet_wrap(.~eventType, nrow = 5) +
labs(x = "Year",
y = "Fatalities",
title = "Fatalities by Type")

Maybe I do not get your problem, but we can start from here:
library(dplyr)
library(ggplot2)
# here the dplyr part
dats <- fatalitiesByYearType %>%
group_by(eventType) %>%
mutate(totalEventFatalities = sum(FATALITIES)) %>%
group_by(year, add = TRUE) %>%
# here we add the summarise
summarise(fatalitiesPct = sum(FATALITIES) / totalEventFatalities)
dats
# A tibble: 6 x 3
# Groups: eventType [?]
eventType year fatalitiesPct
<fct> <int> <dbl>
1 Ice 1980 0.7
2 Ice 1981 0.3
3 Storm 1980 0.125
4 Storm 1981 0.225
5 Storm 1982 0.375
6 Storm 1983 0.275
You can clearly merge everything in an unique dplyr chain:
# here the ggplot2 part
p <- ggplot(dats,aes(x=factor(year),y=fatalitiesPct)) +
geom_bar(stat="identity") +
facet_wrap(.~eventType, nrow = 5) +
labs(x = "Year", y = "Fatalities", title = "Fatalities by Type") +
# here we add the % in the plot
scale_y_continuous(labels = scales::percent)
With data:
fatalitiesByYearType <- read.table(text = "eventType year FATALITIES
Storm 1980 5
Storm 1981 9
Storm 1982 15
Storm 1983 11
Ice 1980 7
Ice 1981 3 ",header = T)

Related

Get the proportions in ggplot2 (R) bar charts

Can someone provide me some hints as to what I am doing wrong in my code? Or what I need to correct to get the correct percentages? I am trying to get the proportions by manipulating my ggplot2 code. I would prefer not mutating a column. However, if I can't get ggplot2 to give me the correct proportions, I will then be open to adding columns.
Here is the reproduceable data:
cat_type<-c("1", "1","2","3","1","3", "3","2","1","1","1","3","3","2","3","2","3","1","3","3","3","1","3","1","3","1","1","3","1")
country<-c("India","India","India","India","India","India","India","India","India","India","Indonesia","Russia","Indonesia","Russia","Russia","Indonesia","Indonesia","Indonesia","Indonesia","Russia","Indonesia","Russia","Indonesia","Indonesia","Russia", "Russia", "India","India","India")
bigcats<-data.frame(cat_type=cat_type,country=country)
My data gives me the following proportions (these are correct):
> table(bigcats$cat_type, bigcats$country) ## raw numbers
India Indonesia Russia
1 7 3 2
2 2 1 1
3 4 5 4
>
> 100*round(prop.table(table(bigcats$cat_type, bigcats$country),2),3) ## proportions by column total
India Indonesia Russia
1 53.8 33.3 28.6
2 15.4 11.1 14.3
3 30.8 55.6 57.1
However, my ggplot2 is giving me the incorrect proportions:
bigcats %>% ggplot(aes(x=country, y = prop.table(stat(count)), fill=cat_type, label = scales::percent(prop.table(stat(count)))))+
geom_bar(position = position_fill())+
geom_text(stat = "count", position = position_fill(vjust=0.5),colour = "white", size = 5)+
labs(y="Percent",title="Top Big Cat Populations",x="Country")+
scale_fill_discrete(name=NULL,labels=c("Siberian/Bengal", "Other wild cats", "Puma/Leopard/Jaguar"))+
scale_y_continuous(labels = scales::percent)
The issue is that using prop.table(stat(count)) will not compute the proportions by categories or your countries, i.e. you do:
library(dplyr)
bigcats %>%
count(cat_type, country) %>%
mutate(pct = scales::percent(prop.table(n)))
#> cat_type country n pct
#> 1 1 India 7 24.1%
#> 2 1 Indonesia 3 10.3%
#> 3 1 Russia 2 6.9%
#> 4 2 India 2 6.9%
#> 5 2 Indonesia 1 3.4%
#> 6 2 Russia 1 3.4%
#> 7 3 India 4 13.8%
#> 8 3 Indonesia 5 17.2%
#> 9 3 Russia 4 13.8%
Making use of a helper function to reduce code duplication you could compute your desired proportions like so:
library(ggplot2)
prop <- function(count, group) {
count / tapply(count, group, sum)[group]
}
ggplot(bigcats, aes(
x = country, y = prop(after_stat(count), after_stat(x)),
fill = cat_type, label = scales::percent(prop(after_stat(count), after_stat(x)))
)) +
geom_bar(position = position_fill()) +
geom_text(stat = "count", position = position_fill(vjust = 0.5), colour = "white", size = 5) +
labs(y = "Percent", title = "Top Big Cat Populations", x = "Country") +
scale_fill_discrete(name = NULL, labels = c("Siberian/Bengal", "Other wild cats", "Puma/Leopard/Jaguar")) +
scale_y_continuous(labels = scales::percent)
Created on 2021-07-28 by the reprex package (v2.0.0)

Visualize rank-change using alluvial in R ggalluvial

I have a pretty basic df in which I have calculated the rank-change of values between two timestamps:
value rank_A rank_B group
1 A 1 1 A
2 B 2 3 A
3 C 3 2 B
4 D 4 4 B
5 E 5 8 A
6 F 6 5 C
7 G 7 6 C
8 H 8 7 A
What makes it a bit tricky (for me) is plotting the values on the Y-axis.
ggplot(df_alluvial, aes(y = value, axis1 = rank_A, axis2 = rank_B))+
geom_alluvium(aes(fill = group), width = 1/12)+
...
As of now, I can plot the rank-change and the groups successfully, but they are not linked to my value-names - there are no axis names and I don't know how to add them.
In the end it should look similiar to this:
https://www.reddit.com/r/GraphicalExcellence/comments/4imh5f/alluvial_diagram_population_size_and_rank_of_uk/
Thanks for your advice!
Your update made the question more clear to me.
The y parameter should be a numerical value, and the data should be in 'long' format. I'm not sure how to change your data to fulfill these requirements. Therefore, I create some new data in this example. I have tried to make the data similar to the data in the plot that you have linked to.
Labels and stratum refer to the city-names. You can use geom_text to label the strata.
# Load libraries
library(tidyverse)
library(ggalluvial)
# Create some data
df_alluvial <- tibble(
city = rep(c("London", "Birmingham", "Manchester"), 4),
year = rep(c(1901, 1911, 1921, 1931), each = 3),
size = c(0, 10, 100, 10, 15, 100, 15, 20, 100, 30, 25, 100))
# Notice the data is in long-format
df_alluvial
#> # A tibble: 12 x 3
#> city year size
#> <chr> <dbl> <dbl>
#> 1 London 1901 0
#> 2 Birmingham 1901 10
#> 3 Manchester 1901 100
#> 4 London 1911 10
#> 5 Birmingham 1911 15
#> 6 Manchester 1911 100
#> 7 London 1921 15
#> 8 Birmingham 1921 20
#> 9 Manchester 1921 100
#> 10 London 1931 30
#> 11 Birmingham 1931 25
#> 12 Manchester 1931 100
ggplot(df_alluvial,
aes(x = as.factor(year), stratum = city, alluvium = city,
y = size,
fill = city, label = city))+
geom_stratum(alpha = .5)+
geom_alluvium()+
geom_text(stat = "stratum", size = 3)
If you want to sort the cities based on their size, you can add decreasing = TRUE to all layers in the plot.
ggplot(df_alluvial,
aes(x = as.factor(year), stratum = city, alluvium = city,
y = size,
fill = city, label = city))+
geom_stratum(alpha = .5, decreasing = TRUE)+
geom_alluvium(decreasing = TRUE)+
geom_text(stat = "stratum", size = 3, decreasing = TRUE)
Created on 2019-11-08 by the reprex package (v0.3.0)

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Is there an R function to subtotals certain rows/columns to easily plot by line or bar chart?

I'm trying to sum the totals of each course status by School letter. Is there a function that I can write to group by School and sum totals for each column?
I wrote the function Group by School to get totals for every school but am having a hard time putting it into a visual using ggplot etc.
Dataframe is MathClass with School as a factor and courses as numerics.
#Group by School
GroupbySchool <- MathClass %>%
group_by(School) %>%
summarise_all(funs(sum))
School Middling Behind `More Behind` `Very Behind` Completed
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 113 450 73 154 142
2 B 84 201 14 22 125
3 C 11 39 4 12 19
4 D 3 8 2 6 3
5 E 11 56 7 15 27
Here is one way to visualize it. First I had to transform the data like so:
library(tidyr)
library(magrittr)
library(ggplot2)
df2 <- gather(df[, -1], Group, Response) %>%
cbind(data.frame(School=rep(LETTERS[1:5], 5)), .)
Then I used ggplot and geom_bar to make this:
ggplot(df2, aes(x=School, y=Response, colour = Group, fill = Group)) +
geom_bar(stat = "identity", position = "dodge") +
theme_bw()
Final output:
Does this do anything like what you're after?
Note:
I had to change the names for More Behind and Very Behind to MB and VB as they wouldn't read in properly

multiple line graphs with trend line

I have a csv file that has four columns(year, TMAX, TMEAN, and TMIN) ranging from the year 1900 to 2014. In a single window, i want to make 3 line graphs of TMAX, TMEAN, and TMIN with X axis Year (1900:2014). I also want want to show the trend lines in the graphs and thier associated r squared values in legends. So far i have written following code:
library(ggplot2)
library(reshape)
data=read.table("temp_red.csv",header=TRUE, sep=",")
frame=data.frame(data[1:4])
meltd=melt(frame,id.vars="Year")
matplot(frame[2:4], type = c("l"),col = 1:3)
ggplot(meltd, aes(x = time, y = value, colour = variable)) + geom_line()
Year TMAX TMEAN TMIN
1900 11.19989107 4.684640523 -1.837690632
1901 10.26497821 4.098583878 -2.074891068
1902 10.03077342 4.025054466 -1.99291939
1903 9.378540305 2.862472767 -3.651416122
1904 8.66040305 2.659313725 -3.351579521
1905 9.703703704 3.590686275 -2.534313725
1906 9.874455338 3.795479303 -2.290305011
2014 8.599673203 2.360566449 -3.88671024
I dont know how to display Trend line with R squared value in the graph using r. Please help.
I believe the following would work for you. Before I start please notice related discussions here and here. First I will generate some input:
library(dplyr)
library(ggplot2)
library(tidyr)
set.seed(1)
year <- 1990:2010
Tmax <- rnorm(21, 9)
Tmean <- rnorm(21, 3.5)
Tmin <- rnorm(21, -2)
df <- data.frame(year, Tmax, Tmean, Tmin)
df <- tbl_df(df)
df
Source: local data frame [21 x 4]
year Tmax Tmean Tmin
(int) (dbl) (dbl) (dbl)
1 1990 8.373546 4.282136 -1.303037
2 1991 9.183643 3.574565 -1.443337
3 1992 8.164371 1.510648 -2.688756
4 1993 10.595281 4.119826 -2.707495
5 1994 9.329508 3.443871 -1.635418
6 1995 8.179532 3.344204 -1.231467
7 1996 9.487429 2.029248 -2.112346
8 1997 9.738325 3.021850 -1.118892
9 1998 9.575781 3.917942 -1.601894
10 1999 8.694612 4.858680 -2.612026
.. ... ... ...
Next I will use tidyr to prepare the data for plotting:
df1 <- df %>% gather(key, Value, -year)
df1
Source: local data frame [63 x 3]
year key Value
(int) (fctr) (dbl)
1 1990 Tmax 8.373546
2 1991 Tmax 9.183643
3 1992 Tmax 8.164371
4 1993 Tmax 10.595281
5 1994 Tmax 9.329508
6 1995 Tmax 8.179532
7 1996 Tmax 9.487429
8 1997 Tmax 9.738325
9 1998 Tmax 9.575781
10 1999 Tmax 8.694612
.. ... ... ...
And just before plotting I will extract the values of R^2 needed for the plot:
r2 <- df1 %>% group_by(key) %>%
do(mod = lm(Value ~ year, data = .)) %>%
mutate(r2sq = summary(mod)$r.squared) %>%
select(key, r2sq)
r2
Source: local data frame [3 x 2]
Groups: <by row>
key r2sq
(fctr) (dbl)
1 Tmax 0.03718175
2 Tmean 0.01216523
3 Tmin 0.02820540
Now to the plot:
pl <- ggplot(df1, aes(x = year, y = Value, col = key)) + geom_line() +
geom_smooth(method = lm)
pl + geom_text(data = r2, aes(x= 2005, y = c(11, 5, 1),
label = paste0("R^2 : ", round(r2sq, 3))), parse = T,
col = "black", show.legend = F)
The result is the following:
Hope this helps.
You could use stat_smooth. Using your meltd dataframe
ggplot(meltd, aes(x = Year, y = value, colour = variable)) +
geom_line() +
stat_smooth(method = lm)
EDIT:
Using geom_smooth(method = lm) will also work.

Resources