How to find the annual evolution rate for each firm in my data table? - r

So I have a data table of 5000 firms, each firm is assigned a numerical value ("id") which is 1 for the first firm, 2 for the second ...
Here is my table with only the profit variable :
|id | year | profit
|:----| :----| :----|
|1 |2001 |-0.4
|1 |2002 |-0.89
|2 |2001 |1.89
|2 |2002 |2.79
Each firm is expressed twice, one line specifies the data in 2001 and the second in 2002 (the "id" value being the same on both lines because it is the same firm one year apart).
How to calculate the annual rate of change of each firm ("id") between 2001 and 2002 ?
I'm really new to R and I don't see where to start? Separate the 2001 and 2002 data?
I did this :
years <- sort(unique(group$year))years
And I also found this on the internet but with no success :
library(dplyr)
res <-
group %>%
arrange(id,year) %>%
group_by(id) %>%
mutate(evol_rate = ("group$year$2002" / lag("group$year$2001") - 1) * 100) %>%
ungroup()
Thank you very much

From what you've written, I take it that you want to calculate the formula for ROC for the profit values of 2001 and 2002:
ROC=(current_value​/previous_value − 1) ∗ 100
To accomplish this, I suggest tidyr::pivot_wider() which reshapes your dataframe from long to wide format (see: https://r4ds.had.co.nz/tidy-data.html#pivoting).
Code:
require(tidyr)
require(dplyr)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
head(df, 10)
#> id year value
#> 1 1 2001 856
#> 2 1 2002 1850
#> 3 2 2001 1687
#> 4 2 2002 1902
#> 5 3 2001 1728
#> 6 3 2002 1773
#> 7 4 2001 691
#> 8 4 2002 1691
#> 9 5 2001 1368
#> 10 5 2002 893
df_wide <- df %>%
pivot_wider(names_from = year,
names_prefix = "profit_",
values_from = value,
values_fn = mean)
res <- df_wide %>%
mutate(evol_rate = (profit_2002/profit_2001-1)*100) %>%
round(2)
head(res, 10)
#> # A tibble: 10 x 4
#> id profit_2001 profit_2002 evol_rate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 856 1850 116.
#> 2 2 1687 1902 12.7
#> 3 3 1728 1773 2.6
#> 4 4 691 1691 145.
#> 5 5 1368 893 -34.7
#> 6 6 883 516 -41.6
#> 7 7 1280 1649 28.8
#> 8 8 1579 1383 -12.4
#> 9 9 1907 1626 -14.7
#> 10 10 1227 1134 -7.58

If you want to do it without reshaping your data into a wide format you can use
library(tidyverse)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
df %>% head(n = 10)
#> id year value
#> 1 1 2001 1173
#> 2 1 2002 1648
#> 3 2 2001 1560
#> 4 2 2002 1091
#> 5 3 2001 1736
#> 6 3 2002 667
#> 7 4 2001 1840
#> 8 4 2002 1202
#> 9 5 2001 1597
#> 10 5 2002 1797
new_df <- df %>%
group_by(id) %>%
mutate(ROC = ((value / lag(value) - 1) * 100))
new_df %>% head(n = 10)
#> # A tibble: 10 × 4
#> # Groups: id [5]
#> id year value ROC
#> <dbl> <dbl> <int> <dbl>
#> 1 1 2001 1173 NA
#> 2 1 2002 1648 40.5
#> 3 2 2001 1560 NA
#> 4 2 2002 1091 -30.1
#> 5 3 2001 1736 NA
#> 6 3 2002 667 -61.6
#> 7 4 2001 1840 NA
#> 8 4 2002 1202 -34.7
#> 9 5 2001 1597 NA
#> 10 5 2002 1797 12.5
This groups the data by id and then uses lag to compare the current year to the year prior

Related

Is it possible to purrr::map the function by using the elements within the same dataframe in r?

x = list(data.frame(age = c(1:4),period = c(2000:2003)),
data.frame(age = c(5:8),period = c(1998:2001)),
data.frame(age = c(11:19),period = c(1990:1998)))
map2(x, x$period, ~cbind(.x, difference = .y-.x$age))
result:
> map2(x, x$period, ~cbind(.x, difference = .y-.x$age))
list()
Is it possible to map the function by using the elements within the same dataframe?
In your context x$period is NULL since x is the list of dataframes and it has no attribute "period". I think you want to access the period column within each unnammed dataframe in the list. I would just use map which will pass along each dataframe in the list, which you can then manipulate in the function to access each column without having to explicitly pass it.
library(purrr)
library(dplyr)
x = list(data.frame(age = c(1:4),period = c(2000:2003)),
data.frame(age = c(5:8),period = c(1998:2001)),
data.frame(age = c(11:19),period = c(1990:1998)))
#Original attempt
result <- map2(x, x$period, ~cbind(.x, difference = .y-.x$age))
result
#> list()
#My solution
result2 <- map(x, function(df) cbind(df, difference = df$period - df$age))
result2
#> [[1]]
#> age period difference
#> 1 1 2000 1999
#> 2 2 2001 1999
#> 3 3 2002 1999
#> 4 4 2003 1999
#>
#> [[2]]
#> age period difference
#> 1 5 1998 1993
#> 2 6 1999 1993
#> 3 7 2000 1993
#> 4 8 2001 1993
#>
#> [[3]]
#> age period difference
#> 1 11 1990 1979
#> 2 12 1991 1979
#> 3 13 1992 1979
#> 4 14 1993 1979
#> 5 15 1994 1979
#> 6 16 1995 1979
#> 7 17 1996 1979
#> 8 18 1997 1979
#> 9 19 1998 1979
#A more readable solution using dplyr
result3 <- map(x, function(df) df %>% mutate(difference = period - age))
result3
#> [[1]]
#> age period difference
#> 1 1 2000 1999
#> 2 2 2001 1999
#> 3 3 2002 1999
#> 4 4 2003 1999
#>
#> [[2]]
#> age period difference
#> 1 5 1998 1993
#> 2 6 1999 1993
#> 3 7 2000 1993
#> 4 8 2001 1993
#>
#> [[3]]
#> age period difference
#> 1 11 1990 1979
#> 2 12 1991 1979
#> 3 13 1992 1979
#> 4 14 1993 1979
#> 5 15 1994 1979
#> 6 16 1995 1979
#> 7 17 1996 1979
#> 8 18 1997 1979
#> 9 19 1998 1979
Created on 2023-02-02 with reprex v2.0.2

How to take the mean of two subsequent rows iteratively thereby reducing the number of rows?

I have a tibble like so:
library(dplyr)
set.seed(1)
my_tib <- tibble(identifier = rep(letters[1:3], each = 4),
year = rep(seq(2005, 2020, 5), 3),
value = rnorm(12, mean = 1000, 100) %>% round()
)
my_tib
# A tibble: 12 × 3
identifier year value
<chr> <dbl> <dbl>
1 a 2005 937
2 a 2010 1018
3 a 2015 916
4 a 2020 1160
5 b 2005 1033
6 b 2010 918
7 b 2015 1049
8 b 2020 1074
9 c 2005 1058
10 c 2010 969
11 c 2015 1151
12 c 2020 1039
Now I'd like to shrink down my tibble by taking the mean value for two years each, creating a new column for the year bracket. For example, I'd like to take the mean of 937 and 1018 (977.5) for the new year_bracket 2005-2010.
I'd like to repeat this for all years and all identifiers.
So the first new 5 rows of my tibble look like this:
head(my_new_tib, 5)
# A tibble: 9 × 3
identifier year_bracket value
<chr> <chr> <dbl>
1 a 2005-2010 977.5
2 a 2010-2015 967
3 a 2015-2020 1038
4 b 2005-2010 975.5
5 b 2010-2015 983.5
Ideally, I'm looking for a piped dplyr solution but I'm also curious regarding other solutions.
Using dplyr:
library(dplyr)
my_tib |>
group_by(identifier) |>
mutate(value = (value + lag(value))/2,
year_bracket = paste0(lag(year)," - ",year),
.keep = "unused",
.before = 2) |>
filter(!is.na(value)) |>
ungroup()
Output:
# A tibble: 9 x 3
identifier year_bracket value
<chr> <chr> <dbl>
1 a 2005 - 2010 978.
2 a 2010 - 2015 967
3 a 2015 - 2020 1038
4 b 2005 - 2010 976.
5 b 2010 - 2015 984.
6 b 2015 - 2020 1062.
7 c 2005 - 2010 1014.
8 c 2010 - 2015 1060
9 c 2015 - 2020 1095
Another possible solution:
library(tidyverse)
my_tib %>%
group_by(identifier) %>%
slice(c(1, rep(2:(n()-1), each = 2) , n())) %>%
group_by(identifier, aux = rep(1:n(), each=2, length.out = n())) %>%
summarise(year_bracket = str_c(year, collapse = "_"), value = mean(value),
.groups = "drop") %>% select(-aux)
#> # A tibble: 9 × 3
#> identifier year_bracket value
#> <chr> <chr> <dbl>
#> 1 a 2005_2010 978.
#> 2 a 2010_2015 967
#> 3 a 2015_2020 1038
#> 4 b 2005_2010 976.
#> 5 b 2010_2015 984.
#> 6 b 2015_2020 1062.
#> 7 c 2005_2010 1014.
#> 8 c 2010_2015 1060
#> 9 c 2015_2020 1095

Sum up with the next line into a new colum

I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)

R - calculate annual population conditional on survival in every year

I have a data frame with three columns: birth_year, death_year, gender.
I have to calculate total alive male and female population for every year in a given range (1950:1980).
The data frame looks like this:
birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female
The person is alive in year x if death_year > x & birth year <= x
The output I am looking for is something like this:
year male female
1950 3 4
1951 2 3
1952 4 3
1953 4 5
.
.
1980 6 3
Thanks!
Does this work:
library(tidyr)
library(purrr)
library(dplyr)
df %>% mutate(year = map2(1950,1980, seq)) %>% unnest(year) %>%
mutate(isalive = case_when(year >= birth_year & year < death_year ~ 1, TRUE ~ 0)) %>%
group_by(year, gender) %>% summarise(alive = sum(isalive)) %>%
pivot_wider(names_from = gender, values_from = alive) %>% print( n = 50)
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 31 x 3
# Groups: year [31]
year female male
<int> <dbl> <dbl>
1 1950 4 3
2 1951 4 3
3 1952 4 3
4 1953 4 3
5 1954 4 3
6 1955 4 3
7 1956 4 2
8 1957 4 2
9 1958 4 2
10 1959 4 2
11 1960 4 2
12 1961 4 2
13 1962 4 2
14 1963 4 2
15 1964 4 2
16 1965 4 2
17 1966 4 1
18 1967 4 1
19 1968 4 1
20 1969 4 1
21 1970 4 1
22 1971 4 1
23 1972 4 1
24 1973 4 1
25 1974 4 1
26 1975 4 1
27 1976 3 1
28 1977 3 1
29 1978 3 1
30 1979 3 1
31 1980 3 1
Data used:
df
# A tibble: 9 x 3
birth_year death_year gender
<dbl> <dbl> <chr>
1 1934 1988 male
2 1922 1993 female
3 1890 1966 male
4 1901 1956 male
5 1946 2009 female
6 1909 1976 female
7 1899 1945 male
8 1887 1949 male
9 1902 1984 female
Here's a simple base R solution. Summing a logical vector will get you your count of alive or dead because TRUE is 1 and FALSE is 0.
number_alive <- function(range, df){
sapply(range, function(x) sum((df$death_year > x) & (df$birth_year <= x)))
}
output <- data.frame('year' = 1950:1980,
'female' = number_alive(1950:1980, df[df$gender == 'female']),
'male' = number_alive(1950:1980, df[df$gender == 'male']))
# year female male
# 1 1950 4 3
# 2 1951 4 3
# 3 1952 4 3
# 4 1953 4 3
# 5 1954 4 3
# 6 1955 4 3
# 7 1956 4 2
# 8 1957 4 2
# 9 1958 4 2
# 10 1959 4 2
# 11 1960 4 2
# 12 1961 4 2
# 13 1962 4 2
# 14 1963 4 2
# 15 1964 4 2
# 16 1965 4 2
# 17 1966 4 1
# 18 1967 4 1
# 19 1968 4 1
# 20 1969 4 1
# 21 1970 4 1
# 22 1971 4 1
# 23 1972 4 1
# 24 1973 4 1
# 25 1974 4 1
# 26 1975 4 1
# 27 1976 3 1
# 28 1977 3 1
# 29 1978 3 1
# 30 1979 3 1
# 31 1980 3 1
This approach uses an ifelse to determine if alive (1) or dead (0).
Data:
df <- "birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female"
df <- read.table(text = df, header = TRUE)
Code:
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
df %>%
mutate(year = map2(1950,1980, seq)) %>%
unnest(year) %>%
select(year, birth_year, death_year, gender) %>%
mutate(
alive = ifelse(year >= birth_year & year <= death_year, 1, 0)
) %>%
group_by(year, gender) %>%
summarise(
is_alive = sum(alive)
) %>%
pivot_wider(
names_from = gender,
values_from = is_alive
) %>%
select(year, male, female)
Output:
#> # A tibble: 31 x 3
#> # Groups: year [31]
#> year male female
#> <int> <dbl> <dbl>
#> 1 1950 3 4
#> 2 1951 3 4
#> 3 1952 3 4
#> 4 1953 3 4
#> 5 1954 3 4
#> 6 1955 3 4
#> 7 1956 3 4
#> 8 1957 2 4
#> 9 1958 2 4
#> 10 1959 2 4
#> # … with 21 more rows
Created on 2020-11-11 by the reprex package (v0.3.0)

dplyr: keep empty levels of factor but not empty levels of a combination of factors that don't appear in data

When grouping and summarising with dplyr, what is the correct way to keep empty levels of each grouping factor but not keep empty combinations from multiple grouping factors?
As an example, consider data recorded at different times at multiple sites. I might filter and then calculate something for each year in each site. I'd like to have the default value of the summary on an empty vector if the filter removes a year completely. So site "a" has 10 years and site "b" has 1 year so I'd always like 11 rows in the summary.
If I use .drop = TRUE in group_by I lose years:
library(dplyr)
library(zoo)
library(lubridate)
set.seed(1)
df <- data.frame(site = factor(c(rep("a", 120), rep("b", 12))),
date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
value = rnorm(132, 50, 10))
df$year <- factor(lubridate::year(df$date))
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = TRUE) %>%
summarise(f = first(date))
#> # A tibble: 6 x 3
#> # Groups: site [1]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2004 2004-08-01
#> 3 a 2005 2005-01-01
#> 4 a 2007 2007-11-01
#> 5 a 2008 2008-10-01
#> 6 a 2009 2009-02-01
and with .drop = FALSE I gain all the extra years for site "b" which were not in the original data:
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = FALSE) %>%
summarise(f = first(date))
#> # A tibble: 20 x 3
#> # Groups: site [2]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2001 NA
#> 3 a 2002 NA
#> 4 a 2003 NA
#> 5 a 2004 2004-08-01
#> 6 a 2005 2005-01-01
#> 7 a 2006 NA
#> 8 a 2007 2007-11-01
#> 9 a 2008 2008-10-01
#> 10 a 2009 2009-02-01
#> 11 b 2000 NA
#> 12 b 2001 NA
#> 13 b 2002 NA
#> 14 b 2003 NA
#> 15 b 2004 NA
#> 16 b 2005 NA
#> 17 b 2006 NA
#> 18 b 2007 NA
#> 19 b 2008 NA
#> 20 b 2009 NA
The best way I could think of was to calculate counts, then merge then filter then drop the count variable, but that's pretty messy.
I know the .drop was only recently added to dplyr, which is very useful for one factor, but is there yet a clean way to do this for multiple factors?
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = FALSE) %>%
summarise(f = first(date)) %>%
left_join(df %>% count(site, year, .drop = FALSE), by = c("site", "year")) %>%
filter(n > 0) %>%
select(-n)
#> # A tibble: 11 x 3
#> # Groups: site [2]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2001 NA
#> 3 a 2002 NA
#> 4 a 2003 NA
#> 5 a 2004 2004-08-01
#> 6 a 2005 2005-01-01
#> 7 a 2006 NA
#> 8 a 2007 2007-11-01
#> 9 a 2008 2008-10-01
#> 10 a 2009 2009-02-01
#> 11 b 2000 NA
Not sure if this is what you like.
If you replace dates with value < 65 with NA instead of filtering them out you can proceed as usual.
df %>%
mutate(date = replace(date, value < 65, NA)) %>%
group_by(site, year) %>%
summarise(f = first(date[!is.na(date)]))
# A tibble: 11 x 3
# Groups: site [2]
site year f
<fct> <fct> <date>
1 a 2000 NA
2 a 2001 NA
3 a 2002 2002-03-01
4 a 2003 NA
5 a 2004 NA
6 a 2005 NA
7 a 2006 2006-02-01
8 a 2007 NA
9 a 2008 2008-07-01
10 a 2009 2009-02-01
11 b 2000 2000-08-01

Resources