dplyr: keep empty levels of factor but not empty levels of a combination of factors that don't appear in data - r

When grouping and summarising with dplyr, what is the correct way to keep empty levels of each grouping factor but not keep empty combinations from multiple grouping factors?
As an example, consider data recorded at different times at multiple sites. I might filter and then calculate something for each year in each site. I'd like to have the default value of the summary on an empty vector if the filter removes a year completely. So site "a" has 10 years and site "b" has 1 year so I'd always like 11 rows in the summary.
If I use .drop = TRUE in group_by I lose years:
library(dplyr)
library(zoo)
library(lubridate)
set.seed(1)
df <- data.frame(site = factor(c(rep("a", 120), rep("b", 12))),
date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
value = rnorm(132, 50, 10))
df$year <- factor(lubridate::year(df$date))
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = TRUE) %>%
summarise(f = first(date))
#> # A tibble: 6 x 3
#> # Groups: site [1]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2004 2004-08-01
#> 3 a 2005 2005-01-01
#> 4 a 2007 2007-11-01
#> 5 a 2008 2008-10-01
#> 6 a 2009 2009-02-01
and with .drop = FALSE I gain all the extra years for site "b" which were not in the original data:
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = FALSE) %>%
summarise(f = first(date))
#> # A tibble: 20 x 3
#> # Groups: site [2]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2001 NA
#> 3 a 2002 NA
#> 4 a 2003 NA
#> 5 a 2004 2004-08-01
#> 6 a 2005 2005-01-01
#> 7 a 2006 NA
#> 8 a 2007 2007-11-01
#> 9 a 2008 2008-10-01
#> 10 a 2009 2009-02-01
#> 11 b 2000 NA
#> 12 b 2001 NA
#> 13 b 2002 NA
#> 14 b 2003 NA
#> 15 b 2004 NA
#> 16 b 2005 NA
#> 17 b 2006 NA
#> 18 b 2007 NA
#> 19 b 2008 NA
#> 20 b 2009 NA
The best way I could think of was to calculate counts, then merge then filter then drop the count variable, but that's pretty messy.
I know the .drop was only recently added to dplyr, which is very useful for one factor, but is there yet a clean way to do this for multiple factors?
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = FALSE) %>%
summarise(f = first(date)) %>%
left_join(df %>% count(site, year, .drop = FALSE), by = c("site", "year")) %>%
filter(n > 0) %>%
select(-n)
#> # A tibble: 11 x 3
#> # Groups: site [2]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2001 NA
#> 3 a 2002 NA
#> 4 a 2003 NA
#> 5 a 2004 2004-08-01
#> 6 a 2005 2005-01-01
#> 7 a 2006 NA
#> 8 a 2007 2007-11-01
#> 9 a 2008 2008-10-01
#> 10 a 2009 2009-02-01
#> 11 b 2000 NA

Not sure if this is what you like.
If you replace dates with value < 65 with NA instead of filtering them out you can proceed as usual.
df %>%
mutate(date = replace(date, value < 65, NA)) %>%
group_by(site, year) %>%
summarise(f = first(date[!is.na(date)]))
# A tibble: 11 x 3
# Groups: site [2]
site year f
<fct> <fct> <date>
1 a 2000 NA
2 a 2001 NA
3 a 2002 2002-03-01
4 a 2003 NA
5 a 2004 NA
6 a 2005 NA
7 a 2006 2006-02-01
8 a 2007 NA
9 a 2008 2008-07-01
10 a 2009 2009-02-01
11 b 2000 2000-08-01

Related

how to find the growth rate of applicants per year

I have this data set with 20 variables, and I want to find the growth rate of applicants per year. The data provided is from 2020-2022. How would I go about that? I tried subsetting the data but I'm stuck on how to approach it. So essentially, I want to put the respective applicants to its corresponding year and calculate the growth rate.
Observations ID# Date
1 1226 2022-10-16
2 1225 2021-10-15
3 1224 2020-08-14
4 1223 2021-12-02
5 1222 2022-02-25
One option is to use lubridate::year to split your year-month-day variable into years and then dplyr::summarize().
library(tidyverse)
library(lubridate)
set.seed(123)
id <- seq(1:100)
date <- as.Date(sample( as.numeric(as.Date('2017-01-01') ): as.numeric(as.Date('2023-01-01') ), 100,
replace = T),
origin = '1970-01-01')
df <- data.frame(id, date) %>%
mutate(year = year(date))
head(df)
#> id date year
#> 1 1 2018-06-10 2018
#> 2 2 2017-07-14 2017
#> 3 3 2022-01-16 2022
#> 4 4 2020-02-16 2020
#> 5 5 2020-06-06 2020
#> 6 6 2020-06-21 2020
df <- df %>%
group_by(year) %>%
summarize(n = n())
head(df)
#> # A tibble: 6 × 2
#> year n
#> <dbl> <int>
#> 1 2017 17
#> 2 2018 14
#> 3 2019 17
#> 4 2020 18
#> 5 2021 11
#> 6 2022 23

How to find the annual evolution rate for each firm in my data table?

So I have a data table of 5000 firms, each firm is assigned a numerical value ("id") which is 1 for the first firm, 2 for the second ...
Here is my table with only the profit variable :
|id | year | profit
|:----| :----| :----|
|1 |2001 |-0.4
|1 |2002 |-0.89
|2 |2001 |1.89
|2 |2002 |2.79
Each firm is expressed twice, one line specifies the data in 2001 and the second in 2002 (the "id" value being the same on both lines because it is the same firm one year apart).
How to calculate the annual rate of change of each firm ("id") between 2001 and 2002 ?
I'm really new to R and I don't see where to start? Separate the 2001 and 2002 data?
I did this :
years <- sort(unique(group$year))years
And I also found this on the internet but with no success :
library(dplyr)
res <-
group %>%
arrange(id,year) %>%
group_by(id) %>%
mutate(evol_rate = ("group$year$2002" / lag("group$year$2001") - 1) * 100) %>%
ungroup()
Thank you very much
From what you've written, I take it that you want to calculate the formula for ROC for the profit values of 2001 and 2002:
ROC=(current_value​/previous_value − 1) ∗ 100
To accomplish this, I suggest tidyr::pivot_wider() which reshapes your dataframe from long to wide format (see: https://r4ds.had.co.nz/tidy-data.html#pivoting).
Code:
require(tidyr)
require(dplyr)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
head(df, 10)
#> id year value
#> 1 1 2001 856
#> 2 1 2002 1850
#> 3 2 2001 1687
#> 4 2 2002 1902
#> 5 3 2001 1728
#> 6 3 2002 1773
#> 7 4 2001 691
#> 8 4 2002 1691
#> 9 5 2001 1368
#> 10 5 2002 893
df_wide <- df %>%
pivot_wider(names_from = year,
names_prefix = "profit_",
values_from = value,
values_fn = mean)
res <- df_wide %>%
mutate(evol_rate = (profit_2002/profit_2001-1)*100) %>%
round(2)
head(res, 10)
#> # A tibble: 10 x 4
#> id profit_2001 profit_2002 evol_rate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 856 1850 116.
#> 2 2 1687 1902 12.7
#> 3 3 1728 1773 2.6
#> 4 4 691 1691 145.
#> 5 5 1368 893 -34.7
#> 6 6 883 516 -41.6
#> 7 7 1280 1649 28.8
#> 8 8 1579 1383 -12.4
#> 9 9 1907 1626 -14.7
#> 10 10 1227 1134 -7.58
If you want to do it without reshaping your data into a wide format you can use
library(tidyverse)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
df %>% head(n = 10)
#> id year value
#> 1 1 2001 1173
#> 2 1 2002 1648
#> 3 2 2001 1560
#> 4 2 2002 1091
#> 5 3 2001 1736
#> 6 3 2002 667
#> 7 4 2001 1840
#> 8 4 2002 1202
#> 9 5 2001 1597
#> 10 5 2002 1797
new_df <- df %>%
group_by(id) %>%
mutate(ROC = ((value / lag(value) - 1) * 100))
new_df %>% head(n = 10)
#> # A tibble: 10 × 4
#> # Groups: id [5]
#> id year value ROC
#> <dbl> <dbl> <int> <dbl>
#> 1 1 2001 1173 NA
#> 2 1 2002 1648 40.5
#> 3 2 2001 1560 NA
#> 4 2 2002 1091 -30.1
#> 5 3 2001 1736 NA
#> 6 3 2002 667 -61.6
#> 7 4 2001 1840 NA
#> 8 4 2002 1202 -34.7
#> 9 5 2001 1597 NA
#> 10 5 2002 1797 12.5
This groups the data by id and then uses lag to compare the current year to the year prior

How to take the mean of two subsequent rows iteratively thereby reducing the number of rows?

I have a tibble like so:
library(dplyr)
set.seed(1)
my_tib <- tibble(identifier = rep(letters[1:3], each = 4),
year = rep(seq(2005, 2020, 5), 3),
value = rnorm(12, mean = 1000, 100) %>% round()
)
my_tib
# A tibble: 12 × 3
identifier year value
<chr> <dbl> <dbl>
1 a 2005 937
2 a 2010 1018
3 a 2015 916
4 a 2020 1160
5 b 2005 1033
6 b 2010 918
7 b 2015 1049
8 b 2020 1074
9 c 2005 1058
10 c 2010 969
11 c 2015 1151
12 c 2020 1039
Now I'd like to shrink down my tibble by taking the mean value for two years each, creating a new column for the year bracket. For example, I'd like to take the mean of 937 and 1018 (977.5) for the new year_bracket 2005-2010.
I'd like to repeat this for all years and all identifiers.
So the first new 5 rows of my tibble look like this:
head(my_new_tib, 5)
# A tibble: 9 × 3
identifier year_bracket value
<chr> <chr> <dbl>
1 a 2005-2010 977.5
2 a 2010-2015 967
3 a 2015-2020 1038
4 b 2005-2010 975.5
5 b 2010-2015 983.5
Ideally, I'm looking for a piped dplyr solution but I'm also curious regarding other solutions.
Using dplyr:
library(dplyr)
my_tib |>
group_by(identifier) |>
mutate(value = (value + lag(value))/2,
year_bracket = paste0(lag(year)," - ",year),
.keep = "unused",
.before = 2) |>
filter(!is.na(value)) |>
ungroup()
Output:
# A tibble: 9 x 3
identifier year_bracket value
<chr> <chr> <dbl>
1 a 2005 - 2010 978.
2 a 2010 - 2015 967
3 a 2015 - 2020 1038
4 b 2005 - 2010 976.
5 b 2010 - 2015 984.
6 b 2015 - 2020 1062.
7 c 2005 - 2010 1014.
8 c 2010 - 2015 1060
9 c 2015 - 2020 1095
Another possible solution:
library(tidyverse)
my_tib %>%
group_by(identifier) %>%
slice(c(1, rep(2:(n()-1), each = 2) , n())) %>%
group_by(identifier, aux = rep(1:n(), each=2, length.out = n())) %>%
summarise(year_bracket = str_c(year, collapse = "_"), value = mean(value),
.groups = "drop") %>% select(-aux)
#> # A tibble: 9 × 3
#> identifier year_bracket value
#> <chr> <chr> <dbl>
#> 1 a 2005_2010 978.
#> 2 a 2010_2015 967
#> 3 a 2015_2020 1038
#> 4 b 2005_2010 976.
#> 5 b 2010_2015 984.
#> 6 b 2015_2020 1062.
#> 7 c 2005_2010 1014.
#> 8 c 2010_2015 1060
#> 9 c 2015_2020 1095

Sum up with the next line into a new colum

I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)

R How to lag a dataframe by groups

I have the following data set:
Name Year VarA VarB Data.1 Data.2
A 2016 L H 100 101
A 2017 L H 105 99
A 2018 L H 103 105
A 2016 L A 90 95
A 2017 L A 99 92
A 2018 L A 102 101
I want to add a lagged variable by the grouping: Name, VarA, VarB so that my data would look like:
Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
A 2016 L H 100 101 NA NA
A 2017 L H 105 99 100 NA
A 2018 L H 103 105 105 100
A 2016 L A 90 95 NA NA
A 2017 L A 99 92 90 NA
A 2018 L A 102 101 99 90
I found the following link, which is helpful: debugging: function to create multiple lags for multiple columns (dplyr)
And am using the following code:
df <- df %>%
group_by(Name) %>%
arrange(Name, VarA, VarB, Year) %>%
do(data.frame(., setNames(shift(.[,c(5:6)], 1:2), c(seq(1:8)))))
However, the lag offsetting all data associated w/ name, instead of the grouping I want, so only the 2018 years are accurately lagged.
Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
A 2016 L H 100 101 NA NA
A 2017 L H 105 99 100 NA
A 2018 L H 103 105 105 100
A 2016 L A 90 95 103 105
A 2017 L A 99 92 90 103
A 2018 L A 102 101 99 90
How do I get the lag to reset for each new grouping combination (e.g. Name / VarA / VarB)?
dplyr::lag lets you set the distance you want to lag by. You can group by whatever variables you want—in this case, Name, VarA, and VarB—before making your lagged variables.
library(dplyr)
df %>%
group_by(Name, VarA, VarB) %>%
mutate(Lg1.Data.1 = lag(Data.1, n = 1), Lg2.Data.1 = lag(Data.1, n = 2))
#> # A tibble: 6 x 8
#> # Groups: Name, VarA, VarB [2]
#> Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
#> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2016 L H 100 101 NA NA
#> 2 A 2017 L H 105 99 100 NA
#> 3 A 2018 L H 103 105 105 100
#> 4 A 2016 L A 90 95 NA NA
#> 5 A 2017 L A 99 92 90 NA
#> 6 A 2018 L A 102 101 99 90
If you want a version that scales to more lags, you can use some non-standard evaluation to create new lagged columns dynamically. I'll do this with purrr::map to iterate of a set of n to lag by, make a list of data frames with the new columns added, then join all the data frames together. There are probably better NSE ways to do this, so hopefully someone can improve upon it.
I'm making up some new data, just to have a wider range of years to illustrate. Inside mutate, you can create column names with quo_name.
library(dplyr)
library(purrr)
set.seed(127)
df <- tibble(
Name = "A", Year = rep(2016:2020, 2), VarA = "L", VarB = rep(c("H", "A"), each = 5),
Data.1 = sample(1:10, 10, replace = T), Data.2 = sample(1:10, 10, replace = T)
)
df_list <- purrr::map(1:4, function(i) {
df %>%
group_by(Name, VarA, VarB) %>%
mutate(!!quo_name(paste0("Lag", i)) := dplyr::lag(Data.1, n = i))
})
You don't need to save this list—I'm just doing it to show an example of one of the data frames. You could instead go straight into reduce.
df_list[[3]]
#> # A tibble: 10 x 7
#> # Groups: Name, VarA, VarB [2]
#> Name Year VarA VarB Data.1 Data.2 Lag3
#> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 A 2016 L H 3 9 NA
#> 2 A 2017 L H 1 4 NA
#> 3 A 2018 L H 3 8 NA
#> 4 A 2019 L H 2 2 3
#> 5 A 2020 L H 4 5 1
#> 6 A 2016 L A 8 4 NA
#> 7 A 2017 L A 6 8 NA
#> 8 A 2018 L A 3 2 NA
#> 9 A 2019 L A 8 6 8
#> 10 A 2020 L A 9 1 6
Then use purrr::reduce to join all the data frames in the list. Since there are columns that are the same in each of the data frames, and those are the ones you want to join by, you can get away with not specifying join-by columns in inner_join.
reduce(df_list, inner_join)
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> # A tibble: 10 x 10
#> # Groups: Name, VarA, VarB [?]
#> Name Year VarA VarB Data.1 Data.2 Lag1 Lag2 Lag3 Lag4
#> <chr> <int> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 A 2016 L H 3 9 NA NA NA NA
#> 2 A 2017 L H 1 4 3 NA NA NA
#> 3 A 2018 L H 3 8 1 3 NA NA
#> 4 A 2019 L H 2 2 3 1 3 NA
#> 5 A 2020 L H 4 5 2 3 1 3
#> 6 A 2016 L A 8 4 NA NA NA NA
#> 7 A 2017 L A 6 8 8 NA NA NA
#> 8 A 2018 L A 3 2 6 8 NA NA
#> 9 A 2019 L A 8 6 3 6 8 NA
#> 10 A 2020 L A 9 1 8 3 6 8
Created on 2018-12-07 by the reprex package (v0.2.1)

Resources