How to mutate a ratio for two populations by year - r

overseas_domestic_indicator ref_year count
<chr> <dbl> <dbl>
1 Domestic 2014 17854
2 Domestic 2015 18371
3 Domestic 2016 18975
4 Domestic 2017 19455
5 Domestic 2018 19819
6 Overseas 2014 6491
7 Overseas 2015 7393
8 Overseas 2016 8594
9 Overseas 2017 9539
10 Overseas 2018 10455
This is my data. I want something like:
ref_year Domestic/Overseas
2014 2.75
2015 ...
... ...
But I don't know how to do this using tidyverse. I tried to use mutate but I don't know how to clarify the count for Domestic and Overseas. Thanks in advance.

You can get the data in wide format first and then divide Domestic by Overseas
library(dplyr)
df %>%
tidyr::pivot_wider(names_from = overseas_domestic_indicator,
values_from = count) %>%
mutate(ratio = Domestic/Overseas)
# ref_year Domestic Overseas ratio
# <int> <int> <int> <dbl>
#1 2014 17854 6491 2.75
#2 2015 18371 7393 2.48
#3 2016 18975 8594 2.21
#4 2017 19455 9539 2.04
#5 2018 19819 10455 1.90

We can do a group by 'ref_year' and summarise by dividing the 'count' corresponding to 'Domestic' with that of 'Overseas' and reshape to 'wide' if needed
library(dplyr)
library(tidyr)
df1 %>%
group_by(ref_year) %>%
summarise(
`Domestic/Overseas` = count[overseas_domestic_indicator == 'Domestic']/
count[overseas_domestic_indicator == 'Overseas'])
# A tibble: 5 x 2
# ref_year `Domestic/Overseas`
# <int> <dbl>
#1 2014 2.75
#2 2015 2.48
#3 2016 2.21
#4 2017 2.04
#5 2018 1.90
Or arrange first and then do a division
df1 %>%
arrange(ref_year, overseas_domestic_indicator) %>%
group_by(ref_year) %>%
summarise( `Domestic/Overseas` = first(count)/last(count))
Or with dcast from data.table
library(data.table)
dcast(setDT(df1), ref_year ~ overseas_domestic_indicator)[,
`Domestic/Overseas` := Domestic/Overseas][]
data
df1 <- structure(list(overseas_domestic_indicator = c("Domestic", "Domestic",
"Domestic", "Domestic", "Domestic", "Overseas", "Overseas", "Overseas",
"Overseas", "Overseas"), ref_year = c(2014L, 2015L, 2016L, 2017L,
2018L, 2014L, 2015L, 2016L, 2017L, 2018L), count = c(17854L,
18371L, 18975L, 19455L, 19819L, 6491L, 7393L, 8594L, 9539L, 10455L
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10"))

This should work too, multiple ways to do this
df %>%
pivot_wider(overseas_domestic_indicator,
names_from = overseas_domestic_indicator,
values_from = count) %>%
mutate(Ratio = Domestic/Overseas)

Related

Aggregate dataframe by condition in R

I have the following DataFrame in R:
Y ... Price Year Quantity Country
010190 ... 4781 2021 4 Germany
010190 ... 367 2021 3 Germany
010190 ... 4781 2021 6 France
010190 ... 250 2021 3 France
020190 ... 690 2021 NA USA
020190 ... 10 2021 6 USA
...... ... .... .. ...
217834 ... 56 2021 3 USA
217834 ... 567 2021 9 USA
As you see the numbers in Y column startin with 01.., 02..., 21... I want to aggregate such kind of rows from 6 digit to 2 digit by considering different categorical column (e.g. Country and Year) and sum numerical columns like Quantity and Price. Also I want to take into account rows with NAs during caclulation. So, in the end I want such kind of output:
Y Price Year Quantity Country
01 5148 2021 7 Germany
01 5031 2021 9 USA
02 700 2021 6 USA
.. .... ... .... ...
21 623 2021 12 USA
You can use group_by and summarize from dplyr
library(dplyr)
df %>%
mutate(Y = sprintf(as.numeric(factor(Y, unique(Y))), fmt = '%02d')) %>%
group_by(Y, Year, Country) %>%
summarize(across(where(is.numeric), sum))
#> # A tibble: 4 x 5
#> # Groups: Y, Year [3]
#> Y Year Country Price Quantity
#> <chr> <int> <chr> <int> <int>
#> 1 01 2021 France 5031 9
#> 2 01 2021 Germany 5148 7
#> 3 02 2021 USA 700 NA
update: request:
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(across(c(Price, Quantity), ~sum(., na.rm = TRUE)))
We could use substr to get the first two characters from Y and group_by and summarise() with sum()
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(Price = sum(Price, na.rm = TRUE),
Quantity = sum(Quantity, na.rm = TRUE)
)
Y Year Country Price Quantity
<chr> <dbl> <chr> <dbl> <dbl>
1 01 2021 France 5031 9
2 01 2021 Germany 5148 7
3 02 2021 USA 700 6
4 21 2021 USA 623 12
Using aggregate and the substring of Y.
aggregate(cbind(Quantity, Price) ~ Y + Year + Country,
transform(dat, Y=substr(Y, 1, 2)), sum)
# Y Year Country Quantity Price
# 1 10 2021 France 9 5031
# 2 10 2021 Germany 7 5148
# 3 20 2021 USA 7 700
# 4 21 2021 USA 12 623
Data:
dat <- structure(list(Y = c(10190L, 10190L, 10190L, 10190L, 20190L,
20190L, 217834L, 217834L), foo = c("...", "...", "...", "...",
"...", "...", "...", "..."), Price = c(4781L, 367L, 4781L, 250L,
690L, 10L, 56L, 567L), Year = c(2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L), model = c(NA, NA, NA, NA, NA, NA, "Tesla",
"Tesla"), Quantity = c(4L, 3L, 6L, 3L, 1L, 6L, 3L, 9L), Country = c("Germany",
"Germany", "France", "France", "USA", "USA", "USA", "USA")), class = "data.frame", row.names = c(NA,
-8L))

Creating a new column using scores from past years (which is in the same dataframe)

I'm sorry if this question has already been answered, but I don't really know how to phrase my question.
I have a data frame structured in this way:
country
year
score
France
2020
10
France
2019
9
Germany
2020
15
Germany
2019
14
I would like to have a new column called previous_year_score that would look into the data frame looking for the "score" of a country for the "year - 1". In this case France 2020 would have a previous_year_score of 9, while France 2019 would have a NA.
You can use match() for this. I imagine there are plenty of other solutions too.
Data:
df <- structure(list(country = c("France", "France", "Germany", "Germany"
), year = c(2020L, 2019L, 2020L, 2019L), score = c(10L, 9L, 15L,
14L), prev_score = c(9L, NA, 14L, NA)), row.names = c(NA, -4L
), class = "data.frame")
Solution:
i <- match(paste(df[[1]],df[[2]]-1),paste(df[[1]],df[[2]]))
df$prev_score <- df[i,3]
You can use the following solution:
library(dplyr)
df %>%
group_by(country) %>%
arrange(year) %>%
mutate(prev_val = ifelse(year - lag(year) == 1, lag(score), NA))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 Germany 2019 14 NA
3 France 2020 10 9
4 Germany 2020 15 14
Using case_when
library(dplyr)
df1 %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(prev_val = case_when(year - lag(year) == 1 ~ lag(score)))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 France 2020 10 9
3 Germany 2019 14 NA
4 Germany 2020 15 14

How to remove all observations for which there is no observation in the current year in R?

num Name year X Y
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
8 2 C 2014 3 55%
9 4 D 2012 4 75%
10 4 D 2013 5 100%
Let's say I have data like the above. I want to remove the observations that do not have any observations in the most recent year. So, in the above, we would be left with A & B, but C & D would be deleted. The most recent season will always in the data and can be referenced with the max() function (i.e., we don't need to hardcode as 2019 and update it yearly).
The plan is to create a facet wrapped line chart where the percentages are on the y-axis and the years are on the x-axis. The facet would be on the names so each individual will have its own line chart with their percentages by year. We don't care about people who left, so that's why we're dropping records. Though, there is a chance they come back, so I don't want to drop them from the underlying data.
One dplyr option could be:
df %>%
group_by(Name) %>%
filter(any(year %in% max(df$year)))
num Name year X Y
<int> <chr> <int> <int> <chr>
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
W can use subset from base R as well by subsetting the 'Name' where 'year' is the max, get the unique elements and create a logical vector with %in% to subset the rows
subset(df1, Name %in% unique(Name[year == max(year)]))
# num Name year X Y
#1 1 A 2015 68 80%
#2 1 A 2016 69 85%
#3 1 A 2017 70 95%
#4 1 A 2018 71 85%
#5 1 A 2019 72 90%
#6 2 B 2018 20 80%
#7 2 B 2019 23 75%
No packages are used
Or the similar syntax in dplyr
library(dplyr)
df1 %>%
filter(Name %in% unique(Name[year == max(year)]))
data
df1 <- structure(list(num = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 4L, 4L
), Name = c("A", "A", "A", "A", "A", "B", "B", "C", "D", "D"),
year = c(2015L, 2016L, 2017L, 2018L, 2019L, 2018L, 2019L,
2014L, 2012L, 2013L), X = c(68L, 69L, 70L, 71L, 72L, 20L,
23L, 3L, 4L, 5L), Y = c("80%", "85%", "95%", "85%", "90%",
"80%", "75%", "55%", "75%", "100%")), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
Using the data frame DF shown in the Note at the end we use semi_join to reduce it to the required names, convert Y to numeric and plot it. DF is not modified.
A possible alternative to the semi_join line is
filter(ave(year == max(year), Name, FUN = any)) %>%
The code is--
library(dplyr)
library(ggplot2)
DF %>%
semi_join(filter(., year == max(year)), by = "Name") %>%
mutate(Y = as.numeric(sub("%", "", Y))) %>%
ggplot(aes(year, Y)) + geom_line() + facet_wrap(~Name)
Note
The input in reproducible form:
Lines <- " num Name year X Y
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
8 2 C 2014 3 55%
9 4 D 2012 4 75%
10 4 D 2013 5 100%"
DF <- read.table(text = Lines)

Calculating a rate of change between min and max years per subgroup

I am relatively new to R and sorry if the question was already asked but I obviously either can't understand the answers or can't find the right key words!
Here is my problem : I have a dataset that looks like that:
Name Year Corg
1 Bois 17 2001 1.7
2 Bois 17 2007 2.1
3 Bois 17 2014 1.9
4 8-Toume 2000 1.7
5 8-Toume 2015 1.4
6 7-Richelien 2 2004 1.1
7 7-Richelien 2 2017 1.5
8 7-Richelien 2 2019 1.2
9 Communaux 2003 1.4
10 Communaux 2016 3.8
11 Communaux 2019 2.4
12 Cocandes 2000 1.7
13 Cocandes 2014 2.1
As you can see, I sometimes have two or three rows of results per Name (theoretically I could even have 4, 5 or more rows per Name).
For each name, I would like to calculate the annual Corg rate of change between the highest year and lowest year.
More specificaly, I would like to do:
(Corg_of_highest_year/Corg_of_lowest_year)^(1/(lowest_year-highest_year))-1
Could you explain me how you would obtain a summarizing dataset that would look like that:
Name Length_in_years Corg_rate
Bois 17 13 0.9%
8-Toume 15 -1.3%
etc.
We can do the calculation using group_by in dplyr
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Length = diff(range(Year)),
Corg_rate = ((Corg[which.max(Year)]/Corg[which.min(Year)]) ^
(1/Length) - 1) * 100)
# A tibble: 5 x 3
# Name Length Corg_rate
# <fct> <int> <dbl>
#1 7-Richelien2 15 0.582
#2 8-Toume 15 -1.29
#3 Bois17 13 0.859
#4 Cocandes 14 1.52
#5 Communaux 16 3.43
To perform the analysis with most recent year and the year with minimum 5 years of difference
df %>%
group_by(Name) %>%
summarise(Length = max(Year) - max(Year[Year <= max(Year) - 5]),
Corg_rate = (Corg[which.max(Year)]/Corg[Year == max(Year[Year <= (max(Year) - 5)])]) ^ (1/Length) - 1,
Corg_rate = Corg_rate * 100)
# Name Length Corg_rate
# <fct> <int> <dbl>
#1 7-Richelien2 15 0.582
#2 8-Toume 15 -1.29
#3 Bois17 7 -1.42
#4 Cocandes 14 1.52
#5 Communaux 16 3.43
data
df <- structure(list(Name = structure(c(3L, 3L, 3L, 2L, 2L, 1L, 1L,
1L, 5L, 5L, 5L, 4L, 4L), .Label = c("7-Richelien2", "8-Toume",
"Bois17", "Cocandes", "Communaux"), class = "factor"), Year = c(2001L,
2007L, 2014L, 2000L, 2015L, 2004L, 2017L, 2019L, 2003L, 2016L,
2019L, 2000L, 2014L), Corg = c(1.7, 2.1, 1.9, 1.7, 1.4, 1.1,
1.5, 1.2, 1.4, 3.8, 2.4, 1.7, 2.1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"))
By first creating an indicator of when the year is max and min in group Name and then spreading the Corg column into MAX_Corg (Corg of the max year) and MIN_corg we can later easily calculate the rate of change.
my_df %>%
group_by(Name) %>%
mutate( #new column denoting the max and min
year_max_min = ifelse(Year == max(Year), "MAX_corg",
ifelse(Year == min(Year), "MIN_corg",
NA
)
)
) %>%
filter(!(is.na(year_max_min))) %>% # removing NA
group_by(Name, year_max_min) %>% #grouping by Name and max_min indicator
summarise(Corg= Corg) %>% #summarising
spread(year_max_min, Corg) %>% #spread the indicator into two column; MAX_corg and MIN_corg
mutate(
rate_of_change = (MAX_corg / MIN_corg)^(1/(MIN_corg - MAX_corg)) - 1 # calculates rate of change
)
Use dplyr group_by(name) and then calculate your value. Here is an example
library(dplyr)
data %>%
group_by(name) %>%
summarise(Length = max(Year)-min(Year), Corg_End = sum(Corg[Year==max(Year), Corg_Start = sum(Corg[Year==min(Year)]))
This shows you the logic of grouping, i.e. after group_by(name) max(Year) will give out the highest year per name instead of overall. Using this logic calculating the change rate should be easy but I won't attempt to try for lack of reproducible data.
Here is a solution using data.table:
df = data.table(df)
mat = df[, .(
Rate = 100*((Corg[which.max(Year)] / Corg[which.min(Year)])^(1/diff(range(Year))) - 1)
), by = Name]
> mat
Name Rate
1: Bois17 0.8592524
2: 8-Toume -1.2860324
3: 7-Richelien2 0.5817615
4: Communaux 3.4261123
5: Cocandes 1.5207989

In old data frame, order by two columns and store first of each row into new data frame

I have a data frame that contains 3 columns and I'd like use the columns date and location to obtain the most recent observation of each location and store it into a new data frame.
> old.data
date location amount
2014 NY 1
2015 NJ 2
2016 NY 3
2015 NM 4
2013 NY 5
2014 NJ 6
2016 NM 7
2016 NJ 8
2015 NY 9
> new.data
date location amount
2016 NJ 8
2016 NM 7
2016 NY 3
Using dplyr:
library(dplyr)
new.data <- old.data %>% arrange(desc(date), location) %>% group_by(location) %>% slice(1)
new.data
Source: local data frame [3 x 2]
Groups: location [3]
date location
<int> <fctr>
1 2016 NJ
2 2016 NM
3 2016 NY
Using data.table:
library(data.table)
# Code updated by Arun
setDT(old.data)[order(-date, location), .(date = date[1L]), by = location]
location date
1: NJ 2016
2: NM 2016
3: NY 2016
Data
old.data <- structure(list(date = c(2014L, 2015L, 2016L, 2015L, 2013L, 2014L,
2016L, 2016L, 2015L), location = structure(c(3L, 1L, 3L, 2L,
3L, 1L, 2L, 1L, 3L), .Label = c("NJ", "NM", "NY"), class = "factor")), .Names = c("date",
"location"), class = "data.frame", row.names = c(NA, -9L))
Update (as OP changed the original dataframe)
The dplyr solution is still valid.
For data.table, this is the only way I could think of:
setDT(old.data)[order(-date, location), colnames(old.data), with = F][date == max(date)]
date location amount
1: 2016 NJ 8
2: 2016 NM 7
3: 2016 NY 3
Using .SD and .SDcols as suggested by Arun
# adding more data
old.data$amount <- 1:9
old.data$a <- 10:18
# Retain all columns
keep_cols <- colnames(old.data)[-2] # Remove the column which is mentioned in by
setDT(old.data)[order(-date, location), .SD[1L], by = location, .SDcols = keep_cols]
# or assigning colnames to .SDcols directly:
setDT(old.data)[order(-date, location), .SD[1L], by = location, .SDcols = (colnames(old.data)[-2])]
location date amount a
1: NJ 2016 8 17
2: NM 2016 7 16
3: NY 2016 3 12
What about this:
library(dplyr)
date <- c(2014, 2015, 2016, 2015, 2013, 2014, 2016, 2016, 2015)
location <- c("NY", "NJ", "NY", "NM", "NY", "NJ", "NM", "NJ", "NY")
old.data <- data.frame(date, location)
new.data <- group_by(old.data, location)
new.data <- summarise(new.data, year = max(date))
Using the data.table package:
library(data.table)
setDT(dat)[order(-date), .SD[1L], by = location]
# location date
# 1: NY 2016
# 2: NM 2016
# 3: NJ 2016

Resources