Calculating a rate of change between min and max years per subgroup - r

I am relatively new to R and sorry if the question was already asked but I obviously either can't understand the answers or can't find the right key words!
Here is my problem : I have a dataset that looks like that:
Name Year Corg
1 Bois 17 2001 1.7
2 Bois 17 2007 2.1
3 Bois 17 2014 1.9
4 8-Toume 2000 1.7
5 8-Toume 2015 1.4
6 7-Richelien 2 2004 1.1
7 7-Richelien 2 2017 1.5
8 7-Richelien 2 2019 1.2
9 Communaux 2003 1.4
10 Communaux 2016 3.8
11 Communaux 2019 2.4
12 Cocandes 2000 1.7
13 Cocandes 2014 2.1
As you can see, I sometimes have two or three rows of results per Name (theoretically I could even have 4, 5 or more rows per Name).
For each name, I would like to calculate the annual Corg rate of change between the highest year and lowest year.
More specificaly, I would like to do:
(Corg_of_highest_year/Corg_of_lowest_year)^(1/(lowest_year-highest_year))-1
Could you explain me how you would obtain a summarizing dataset that would look like that:
Name Length_in_years Corg_rate
Bois 17 13 0.9%
8-Toume 15 -1.3%
etc.

We can do the calculation using group_by in dplyr
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Length = diff(range(Year)),
Corg_rate = ((Corg[which.max(Year)]/Corg[which.min(Year)]) ^
(1/Length) - 1) * 100)
# A tibble: 5 x 3
# Name Length Corg_rate
# <fct> <int> <dbl>
#1 7-Richelien2 15 0.582
#2 8-Toume 15 -1.29
#3 Bois17 13 0.859
#4 Cocandes 14 1.52
#5 Communaux 16 3.43
To perform the analysis with most recent year and the year with minimum 5 years of difference
df %>%
group_by(Name) %>%
summarise(Length = max(Year) - max(Year[Year <= max(Year) - 5]),
Corg_rate = (Corg[which.max(Year)]/Corg[Year == max(Year[Year <= (max(Year) - 5)])]) ^ (1/Length) - 1,
Corg_rate = Corg_rate * 100)
# Name Length Corg_rate
# <fct> <int> <dbl>
#1 7-Richelien2 15 0.582
#2 8-Toume 15 -1.29
#3 Bois17 7 -1.42
#4 Cocandes 14 1.52
#5 Communaux 16 3.43
data
df <- structure(list(Name = structure(c(3L, 3L, 3L, 2L, 2L, 1L, 1L,
1L, 5L, 5L, 5L, 4L, 4L), .Label = c("7-Richelien2", "8-Toume",
"Bois17", "Cocandes", "Communaux"), class = "factor"), Year = c(2001L,
2007L, 2014L, 2000L, 2015L, 2004L, 2017L, 2019L, 2003L, 2016L,
2019L, 2000L, 2014L), Corg = c(1.7, 2.1, 1.9, 1.7, 1.4, 1.1,
1.5, 1.2, 1.4, 3.8, 2.4, 1.7, 2.1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"))

By first creating an indicator of when the year is max and min in group Name and then spreading the Corg column into MAX_Corg (Corg of the max year) and MIN_corg we can later easily calculate the rate of change.
my_df %>%
group_by(Name) %>%
mutate( #new column denoting the max and min
year_max_min = ifelse(Year == max(Year), "MAX_corg",
ifelse(Year == min(Year), "MIN_corg",
NA
)
)
) %>%
filter(!(is.na(year_max_min))) %>% # removing NA
group_by(Name, year_max_min) %>% #grouping by Name and max_min indicator
summarise(Corg= Corg) %>% #summarising
spread(year_max_min, Corg) %>% #spread the indicator into two column; MAX_corg and MIN_corg
mutate(
rate_of_change = (MAX_corg / MIN_corg)^(1/(MIN_corg - MAX_corg)) - 1 # calculates rate of change
)

Use dplyr group_by(name) and then calculate your value. Here is an example
library(dplyr)
data %>%
group_by(name) %>%
summarise(Length = max(Year)-min(Year), Corg_End = sum(Corg[Year==max(Year), Corg_Start = sum(Corg[Year==min(Year)]))
This shows you the logic of grouping, i.e. after group_by(name) max(Year) will give out the highest year per name instead of overall. Using this logic calculating the change rate should be easy but I won't attempt to try for lack of reproducible data.

Here is a solution using data.table:
df = data.table(df)
mat = df[, .(
Rate = 100*((Corg[which.max(Year)] / Corg[which.min(Year)])^(1/diff(range(Year))) - 1)
), by = Name]
> mat
Name Rate
1: Bois17 0.8592524
2: 8-Toume -1.2860324
3: 7-Richelien2 0.5817615
4: Communaux 3.4261123
5: Cocandes 1.5207989

Related

Checking data in R (whether the full range of values in a data frame column exists)

I have stumbled across a problem in checking data with R. I am fairly new to it and unfortunately, I have not managed to find a solution.
An example of my data frame (let's call it X) is are as follows:
ID Year Month
1 2012 7
1 2012 8
1 2012 9
2 2012 10
1 2012 11
3 2012 12
What I want to do is check for each ID whether all the months from 1 until 12 are present. I have tried this code :
Dataset_check <- X %>% mutate(check=X$ID<- ifelse(sapply(X$ID, function(Month)
any(X$Month <=12 & X$Month >=1)), "YES", NA))
but it does not check whether ALL of the months are included but rather if any of the months (1 through 12) are there.
I am not sure which function to use if not "any" to designate that I want to check if all of them exist or not. Do you have any ideas? Am I in the right track at all or should I look at it another way?
Thank you in advance.
Does this work:
library(dplyr)
df %>% group_by(ID) %>% mutate(check = if_else(all(1:12 %in% Month), 'Yes','No'))
# A tibble: 6 x 4
# Groups: ID [3]
ID Year Month check
<dbl> <dbl> <int> <chr>
1 1 2012 7 No
2 1 2012 8 No
3 1 2012 9 No
4 2 2012 10 No
5 1 2012 11 No
6 3 2012 12 No
We may use base R as well
df1$check <- with(df1, c("No", "Yes")[1 + ave(Month, ID,
FUN = function(x) all(1:12 %in% x))])
df1$check
[1] "No" "No" "No" "No" "No" "No"
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 1L, 3L), Year = c(2012L,
2012L, 2012L, 2012L, 2012L, 2012L), Month = 7:12),
class = "data.frame", row.names = c(NA,
-6L))

How to mutate a ratio for two populations by year

overseas_domestic_indicator ref_year count
<chr> <dbl> <dbl>
1 Domestic 2014 17854
2 Domestic 2015 18371
3 Domestic 2016 18975
4 Domestic 2017 19455
5 Domestic 2018 19819
6 Overseas 2014 6491
7 Overseas 2015 7393
8 Overseas 2016 8594
9 Overseas 2017 9539
10 Overseas 2018 10455
This is my data. I want something like:
ref_year Domestic/Overseas
2014 2.75
2015 ...
... ...
But I don't know how to do this using tidyverse. I tried to use mutate but I don't know how to clarify the count for Domestic and Overseas. Thanks in advance.
You can get the data in wide format first and then divide Domestic by Overseas
library(dplyr)
df %>%
tidyr::pivot_wider(names_from = overseas_domestic_indicator,
values_from = count) %>%
mutate(ratio = Domestic/Overseas)
# ref_year Domestic Overseas ratio
# <int> <int> <int> <dbl>
#1 2014 17854 6491 2.75
#2 2015 18371 7393 2.48
#3 2016 18975 8594 2.21
#4 2017 19455 9539 2.04
#5 2018 19819 10455 1.90
We can do a group by 'ref_year' and summarise by dividing the 'count' corresponding to 'Domestic' with that of 'Overseas' and reshape to 'wide' if needed
library(dplyr)
library(tidyr)
df1 %>%
group_by(ref_year) %>%
summarise(
`Domestic/Overseas` = count[overseas_domestic_indicator == 'Domestic']/
count[overseas_domestic_indicator == 'Overseas'])
# A tibble: 5 x 2
# ref_year `Domestic/Overseas`
# <int> <dbl>
#1 2014 2.75
#2 2015 2.48
#3 2016 2.21
#4 2017 2.04
#5 2018 1.90
Or arrange first and then do a division
df1 %>%
arrange(ref_year, overseas_domestic_indicator) %>%
group_by(ref_year) %>%
summarise( `Domestic/Overseas` = first(count)/last(count))
Or with dcast from data.table
library(data.table)
dcast(setDT(df1), ref_year ~ overseas_domestic_indicator)[,
`Domestic/Overseas` := Domestic/Overseas][]
data
df1 <- structure(list(overseas_domestic_indicator = c("Domestic", "Domestic",
"Domestic", "Domestic", "Domestic", "Overseas", "Overseas", "Overseas",
"Overseas", "Overseas"), ref_year = c(2014L, 2015L, 2016L, 2017L,
2018L, 2014L, 2015L, 2016L, 2017L, 2018L), count = c(17854L,
18371L, 18975L, 19455L, 19819L, 6491L, 7393L, 8594L, 9539L, 10455L
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10"))
This should work too, multiple ways to do this
df %>%
pivot_wider(overseas_domestic_indicator,
names_from = overseas_domestic_indicator,
values_from = count) %>%
mutate(Ratio = Domestic/Overseas)

How to remove all observations for which there is no observation in the current year in R?

num Name year X Y
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
8 2 C 2014 3 55%
9 4 D 2012 4 75%
10 4 D 2013 5 100%
Let's say I have data like the above. I want to remove the observations that do not have any observations in the most recent year. So, in the above, we would be left with A & B, but C & D would be deleted. The most recent season will always in the data and can be referenced with the max() function (i.e., we don't need to hardcode as 2019 and update it yearly).
The plan is to create a facet wrapped line chart where the percentages are on the y-axis and the years are on the x-axis. The facet would be on the names so each individual will have its own line chart with their percentages by year. We don't care about people who left, so that's why we're dropping records. Though, there is a chance they come back, so I don't want to drop them from the underlying data.
One dplyr option could be:
df %>%
group_by(Name) %>%
filter(any(year %in% max(df$year)))
num Name year X Y
<int> <chr> <int> <int> <chr>
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
W can use subset from base R as well by subsetting the 'Name' where 'year' is the max, get the unique elements and create a logical vector with %in% to subset the rows
subset(df1, Name %in% unique(Name[year == max(year)]))
# num Name year X Y
#1 1 A 2015 68 80%
#2 1 A 2016 69 85%
#3 1 A 2017 70 95%
#4 1 A 2018 71 85%
#5 1 A 2019 72 90%
#6 2 B 2018 20 80%
#7 2 B 2019 23 75%
No packages are used
Or the similar syntax in dplyr
library(dplyr)
df1 %>%
filter(Name %in% unique(Name[year == max(year)]))
data
df1 <- structure(list(num = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 4L, 4L
), Name = c("A", "A", "A", "A", "A", "B", "B", "C", "D", "D"),
year = c(2015L, 2016L, 2017L, 2018L, 2019L, 2018L, 2019L,
2014L, 2012L, 2013L), X = c(68L, 69L, 70L, 71L, 72L, 20L,
23L, 3L, 4L, 5L), Y = c("80%", "85%", "95%", "85%", "90%",
"80%", "75%", "55%", "75%", "100%")), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
Using the data frame DF shown in the Note at the end we use semi_join to reduce it to the required names, convert Y to numeric and plot it. DF is not modified.
A possible alternative to the semi_join line is
filter(ave(year == max(year), Name, FUN = any)) %>%
The code is--
library(dplyr)
library(ggplot2)
DF %>%
semi_join(filter(., year == max(year)), by = "Name") %>%
mutate(Y = as.numeric(sub("%", "", Y))) %>%
ggplot(aes(year, Y)) + geom_line() + facet_wrap(~Name)
Note
The input in reproducible form:
Lines <- " num Name year X Y
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
8 2 C 2014 3 55%
9 4 D 2012 4 75%
10 4 D 2013 5 100%"
DF <- read.table(text = Lines)

R: How to to replace(switch) the max and min values in a row in a dataframe when max <= min?

How we can replace(switch) the max and min values in each row in this dataframe ONLY if max <= min ?
> my_data
year month day max min
1 2019 1 1 20.4 -24.4
2 2019 1 2 12.9 -20.4
3 2019 1 3 -27.1 10.3
4 2019 1 4 -20.8 11.0
5 2019 1 5 -16.2 -8.9
The result should be like this:
> my_data
year month day max min
1 2019 1 1 20.4 -24.4
2 2019 1 2 12.9 -20.4
3 2019 1 3 10.3 -27.1
4 2019 1 4 11.0 -20.8
5 2019 1 5 -8.9 -16.2
Thanks in advance.
One option is pmax/pmin
library(dplyr)
my_data %>%
mutate(maxnew = pmax(max, min), minnew = pmin(max, min)) %>%
select(year, month, day, max = maxnew, min = minnew)
# year month day max min
#1 2019 1 1 20.4 -24.4
#2 2019 1 2 12.9 -20.4
#3 2019 1 3 10.3 -27.1
#4 2019 1 4 11.0 -20.8
#5 2019 1 5 -8.9 -16.2
Or a compact way is with base R
nm1 <- c('max', 'min')
my_data[nm1] <- t(apply(my_data[nm1], 1, sort))[, 2:1]
Or using pmax/pmin
my_data[nm1] <- lapply(list(pmax, pmin), function(f) do.call(f, my_data[nm1]))
data
my_data <- structure(list(year = c(2019L, 2019L, 2019L, 2019L, 2019L), month = c(1L,
1L, 1L, 1L, 1L), day = 1:5, max = c(20.4, 12.9, -27.1, -20.8,
-16.2), min = c(-24.4, -20.4, 10.3, 11, -8.9)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
We can find the index where max is less than min. Store those max values in temporary variable and then swap the max and min values using that.
inds <- df$max < df$min
temp <- df$max[inds]
df$max[inds] <- df$min[inds]
df$min[inds] <- temp
df
# year month day max min
#1 2019 1 1 20.4 -24.4
#2 2019 1 2 12.9 -20.4
#3 2019 1 3 10.3 -27.1
#4 2019 1 4 11.0 -20.8
#5 2019 1 5 -8.9 -16.2
data
df <- structure(list(year = c(2019L, 2019L, 2019L, 2019L, 2019L), month = c(1L,
1L, 1L, 1L, 1L), day = 1:5, max = c(20.4, 12.9, -27.1, -20.8,
-16.2), min = c(-24.4, -20.4, 10.3, 11, -8.9)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))

Using dplyr to summarize by multiple groups

I'm trying to use dplyr to summarize a dataset based on 2 groups: "year" and "area". This is how the dataset looks like:
Year Area Num
1 2000 Area 1 99
2 2001 Area 3 85
3 2000 Area 1 60
4 2003 Area 2 90
5 2002 Area 1 40
6 2002 Area 3 30
7 2004 Area 4 10
...
The end result should look something like this:
Year Area Mean
1 2000 Area 1 100
2 2000 Area 2 80
3 2000 Area 3 89
4 2001 Area 1 80
5 2001 Area 2 85
6 2001 Area 3 59
7 2002 Area 1 90
8 2002 Area 2 88
...
Excuse the values for "mean", they're made up.
The code for the example dataset:
df <- structure(list(
Year = c(2000, 2001, 2000, 2003, 2002, 2002, 2004),
Area = structure(c(1L, 3L, 1L, 2L, 1L, 3L, 4L),
.Label = c("Area 1", "Area 2", "Area 3", "Area 4"),
class = "factor"),
Num = structure(c(7L, 5L, 4L, 6L, 3L, 2L, 1L),
.Label = c("10", "30", "40", "60", "85", "90", "99"),
class = "factor")),
.Names = c("Year", "Area", "Num"),
class = "data.frame", row.names = c(NA, -7L))
df$Num <- as.numeric(df$Num)
Things I've tried:
df.meanYear <- df %>%
group_by(Year) %>%
group_by(Area) %>%
summarize_each(funs(mean(Num)))
But it just replaces every value with the mean, instead of the intended result.
If possible please do provide alternate means (i.e. non-dplyr) methods, because I'm still new with R.
Is this what you are looking for?
library(dplyr)
df <- group_by(df, Year, Area)
df <- summarise(df, avg = mean(Num))
We can use data.table
library(data.table)
setDT(df)[, .(avg = mean(Num)) , by = .(Year, Area)]
I had a similar problem in my code, I fixed it with the .groups attribute:
df %>%
group_by(Year,Area) %>%
summarise(avg = mean(Num), .groups="keep")
Also verified with the added example (as.numeric corrupted Num values, so I used as.numeric(as.character(df$Num)) to fix it):
Year Area avg
<dbl> <fct> <dbl>
1 2000 Area 1 79.5
2 2001 Area 3 85
3 2002 Area 1 40
4 2002 Area 3 30
5 2003 Area 2 90
6 2004 Area 4 10

Resources