R loop over nominal list and integers - r
I have a dataset where I have been able to loop over different test values with dpois. For simplicity's sake, I have used an average of 4 events per month and I wanted to know what is the likelihood of n or more events, given the average. Here is what I have managed to make work:
MonthlyAverage <- 4
cnt <- c(0:10)
for (i in cnt) {
CountProb <- ppois(cnt,MonthlyAverage,lower.tail=FALSE)
}
dfProb <- data.frame(cnt,CountProb)
I am interested in investigating this to figure out how many events I may expect each month given the mean of that month.
I would be looking to say:
For January, what is the probability of 0
For January, what is the probability of 1
For January, what is the probability of 2
etc...
For February, what is the probability of 0
For February, what is the probability of 1
For February, what is the probability of 2
etc.
To give something like (numbers here are just an example):
I thought about trying one loop to select the correct month and then remove the month column so I am just left with the single "Monthly Average" value and then performing the count loop, but that doesn't seem to work. I still get "Non-numeric argument to mathematical function". I feel like I'm close, but can anyone please point me in the right direction for the formatting?
a "tidy-style" solution:
library(tidyr)
library(dplyr)
## example data:
df <- data.frame(Month = c('Jan', 'Feb'),
MonthlyAverage = c(5, 2)
)
> df
Month MonthlyAverage
1 Jan 5
2 Feb 2
df |>
mutate(n = list(1:10)) |>
unnest_longer(n) |>
mutate(CountProb = ppois(n, MonthlyAverage,
lower.tail=FALSE
)
)
# A tibble: 20 x 4
Month MonthlyAverage n CountProb
<chr> <dbl> <int> <dbl>
1 Jan 5 1 0.960
2 Jan 5 2 0.875
3 Jan 5 3 0.735
4 Jan 5 4 0.560
5 Jan 5 5 0.384
6 Jan 5 6 0.238
## ...
How about something like this:
cnt <- 0:10
MonthlyAverage <- c(1.8, 1.56, 2.44, 1.86, 2.1, 2.3, 2, 2.78, 1.89, 1.86, 1.4, 1.71)
grid <- expand.grid(cnt =cnt, m_num = 1:12)
grid$MonthlyAverage <- MonthlyAverage[grid$m_num]
mnames <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
grid$month <- mnames[grid$m_num]
grid$prob <- ppois(grid$cnt, grid$MonthlyAverage, lower.tail=FALSE)
grid[,c("month", "cnt", "prob")]
#> month cnt prob
#> 1 Jan 0 8.347011e-01
#> 2 Jan 1 5.371631e-01
#> 3 Jan 2 2.693789e-01
#> 4 Jan 3 1.087084e-01
#> 5 Jan 4 3.640666e-02
#> 6 Jan 5 1.037804e-02
#> 7 Jan 6 2.569450e-03
#> 8 Jan 7 5.615272e-04
#> 9 Jan 8 1.097446e-04
#> 10 Jan 9 1.938814e-05
#> 11 Jan 10 3.123964e-06
#> 12 Feb 0 7.898639e-01
#> 13 Feb 1 4.620517e-01
#> 14 Feb 2 2.063581e-01
#> 15 Feb 3 7.339743e-02
#> 16 Feb 4 2.154277e-02
#> 17 Feb 5 5.364120e-03
#> 18 Feb 6 1.157670e-03
#> 19 Feb 7 2.202330e-04
#> 20 Feb 8 3.743272e-05
#> 21 Feb 9 5.747339e-06
#> 22 Feb 10 8.044197e-07
#> 23 Mar 0 9.128391e-01
#> 24 Mar 1 7.001667e-01
#> 25 Mar 2 4.407062e-01
#> 26 Mar 3 2.296784e-01
#> 27 Mar 4 1.009515e-01
#> 28 Mar 5 3.813271e-02
#> 29 Mar 6 1.258642e-02
#> 30 Mar 7 3.681711e-03
#> 31 Mar 8 9.657751e-04
#> 32 Mar 9 2.294546e-04
#> 33 Mar 10 4.979244e-05
#> 34 Apr 0 8.443274e-01
#> 35 Apr 1 5.547763e-01
#> 36 Apr 2 2.854938e-01
#> 37 Apr 3 1.185386e-01
#> 38 Apr 4 4.090445e-02
#> 39 Apr 5 1.202455e-02
#> 40 Apr 6 3.071778e-03
#> 41 Apr 7 6.928993e-04
#> 42 Apr 8 1.398099e-04
#> 43 Apr 9 2.550478e-05
#> 44 Apr 10 4.244028e-06
#> 45 May 0 8.775436e-01
#> 46 May 1 6.203851e-01
#> 47 May 2 3.503686e-01
#> 48 May 3 1.613572e-01
#> 49 May 4 6.212612e-02
#> 50 May 5 2.044908e-02
#> 51 May 6 5.862118e-03
#> 52 May 7 1.486029e-03
#> 53 May 8 3.373058e-04
#> 54 May 9 6.927041e-05
#> 55 May 10 1.298297e-05
#> 56 Jun 0 8.997412e-01
#> 57 Jun 1 6.691458e-01
#> 58 Jun 2 4.039612e-01
#> 59 Jun 3 2.006529e-01
#> 60 Jun 4 8.375072e-02
#> 61 Jun 5 2.997569e-02
#> 62 Jun 6 9.361934e-03
#> 63 Jun 7 2.588841e-03
#> 64 Jun 8 6.415773e-04
#> 65 Jun 9 1.439431e-04
#> 66 Jun 10 2.948727e-05
#> 67 Jul 0 8.646647e-01
#> 68 Jul 1 5.939942e-01
#> 69 Jul 2 3.233236e-01
#> 70 Jul 3 1.428765e-01
#> 71 Jul 4 5.265302e-02
#> 72 Jul 5 1.656361e-02
#> 73 Jul 6 4.533806e-03
#> 74 Jul 7 1.096719e-03
#> 75 Jul 8 2.374473e-04
#> 76 Jul 9 4.649808e-05
#> 77 Jul 10 8.308224e-06
#> 78 Aug 0 9.379615e-01
#> 79 Aug 1 7.654944e-01
#> 80 Aug 2 5.257652e-01
#> 81 Aug 3 3.036162e-01
#> 82 Aug 4 1.492226e-01
#> 83 Aug 5 6.337975e-02
#> 84 Aug 6 2.360590e-02
#> 85 Aug 7 7.809999e-03
#> 86 Aug 8 2.320924e-03
#> 87 Aug 9 6.254093e-04
#> 88 Aug 10 1.540564e-04
#> 89 Sep 0 8.489282e-01
#> 90 Sep 1 5.634025e-01
#> 91 Sep 2 2.935807e-01
#> 92 Sep 3 1.235929e-01
#> 93 Sep 4 4.327373e-02
#> 94 Sep 5 1.291307e-02
#> 95 Sep 6 3.349459e-03
#> 96 Sep 7 7.672845e-04
#> 97 Sep 8 1.572459e-04
#> 98 Sep 9 2.913775e-05
#> 99 Sep 10 4.925312e-06
#> 100 Oct 0 8.443274e-01
#> 101 Oct 1 5.547763e-01
#> 102 Oct 2 2.854938e-01
#> 103 Oct 3 1.185386e-01
#> 104 Oct 4 4.090445e-02
#> 105 Oct 5 1.202455e-02
#> 106 Oct 6 3.071778e-03
#> 107 Oct 7 6.928993e-04
#> 108 Oct 8 1.398099e-04
#> 109 Oct 9 2.550478e-05
#> 110 Oct 10 4.244028e-06
#> 111 Nov 0 7.534030e-01
#> 112 Nov 1 4.081673e-01
#> 113 Nov 2 1.665023e-01
#> 114 Nov 3 5.372525e-02
#> 115 Nov 4 1.425330e-02
#> 116 Nov 5 3.201149e-03
#> 117 Nov 6 6.223149e-04
#> 118 Nov 7 1.065480e-04
#> 119 Nov 8 1.628881e-05
#> 120 Nov 9 2.248494e-06
#> 121 Nov 10 2.828495e-07
#> 122 Dec 0 8.191342e-01
#> 123 Dec 1 5.098537e-01
#> 124 Dec 2 2.454189e-01
#> 125 Dec 3 9.469102e-02
#> 126 Dec 4 3.025486e-02
#> 127 Dec 5 8.217692e-03
#> 128 Dec 6 1.937100e-03
#> 129 Dec 7 4.028407e-04
#> 130 Dec 8 7.489285e-05
#> 131 Dec 9 1.258275e-05
#> 132 Dec 10 1.927729e-06
Created on 2023-01-09 by the reprex package (v2.0.1)
If you have each month's mean, in base R you could easily use sapply to estimate the probability of obtaining values 0 to 10 using each month's mean value. Then you can simply combine it in a data frame:
# Data
df <- data.frame(month = month.name,
mean = c(1.8, 2.8, 1.7, 1.6, 1.8, 2,
2.3, 2.4, 2.1, 1.4, 1.9, 1.9))
probs <- sapply(1:12, function(x) ppois(0:10, df$mean[x], lower.tail = FALSE))
finaldata <- data.frame(month = rep(month.name, each = 11),
events = rep(0:10, times = 12),
prob = prob = as.vector(probs))
Output:
# month events prob
# 1 January 0 8.347011e-01
# 2 January 1 5.371631e-01
# 3 January 2 2.693789e-01
# 4 January 3 1.087084e-01
# 5 January 4 3.640666e-02
# 6 January 5 1.037804e-02
# 7 January 6 2.569450e-03
# 8 January 7 5.615272e-04
# 9 January 8 1.097446e-04
# 10 January 9 1.938814e-05
# 11 January 10 3.123964e-06
# 12 February 0 9.391899e-01
# 13 February 1 7.689218e-01
# 14 February 2 5.305463e-01
# 15 February 3 3.080626e-01
# ...
# 131 December 9 3.044317e-05
# 132 December 10 5.172695e-06
Related
Mutate Year with Month Column for a Time Series Data Input in R Using Lubridate Package
I have this time series data frame as follows: df <- read.table(text = "Year Month Value 2021 1 4 2021 2 11 2021 3 18 2021 4 6 2021 5 20 2021 6 5 2021 7 12 2021 8 4 2021 9 11 2021 10 18 2021 11 6 2021 12 20 2022 1 14 2022 2 11 2022 3 18 2022 4 9 2022 5 22 2022 6 19 2022 7 22 2022 8 24 2022 9 17 2022 10 28 2022 11 16 2022 12 26", header = TRUE) I want to turn this data frame into a time series object of date column and value column only so that I can use the ts function to filter the starting point and the endpoint like ts(ts, start = starts, frequency = 12). R should know that 2022 is a year and the corresponding 1:12 are its months, the same thing should apply to 2021. I will prefer lubridate package. pacman::p_load( dplyr, lubridate) UPDATE I now use unite function from dplyr package. df|> unite(col='date', c('Year', 'Month'), sep='')
Perhaps this? df |> tidyr::unite(col='date', c('Year', 'Month'), sep='-') |> mutate(date = lubridate::ym(date)) # date Value # 1 2021-01-01 4 # 2 2021-02-01 11 # 3 2021-03-01 18 # 4 2021-04-01 6 # 5 2021-05-01 20 # 6 2021-06-01 5 # 7 2021-07-01 12 # 8 2021-08-01 4 # 9 2021-09-01 11 # 10 2021-10-01 18 # 11 2021-11-01 6 # 12 2021-12-01 20 # 13 2022-01-01 14 # 14 2022-02-01 11 # 15 2022-03-01 18 # 16 2022-04-01 9 # 17 2022-05-01 22 # 18 2022-06-01 19 # 19 2022-07-01 22 # 20 2022-08-01 24 # 21 2022-09-01 17 # 22 2022-10-01 28 # 23 2022-11-01 16 # 24 2022-12-01 26
How fill NA by condition?
I have sales data by years and models. Here sales of J model in each year is missing. Now I want the following condition: Fill NA of J model with a maximum value of sales in each year + 100. For instance, max sale in 2015 was 984, so J has to be 984+100 in 2015 df <- data.frame (model = c("A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J"), Year = c(2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020), sales = c(450,678,456,344,984,456,234,244,655,"NA",234,567,234,567,232,900,1005,1900,450,"NA",567,235,456,345,144,333,555,777,111,"NA",222,223,445,776,331,788,980,1003,456,"NA",345,2222,3456,456,678,8911,4560,4567,4566,"NA",6666,7777,8888,1233,1255,5677,3411,2344,6122,"NA"))
You may try(NA is "NA" so it needed to be as numeric) library(dplyr) df %>% group_by(Year) %>% mutate(sales = as.numeric(sales)) %>% mutate(sales = ifelse(is.na(sales) & (model == "J"), max(sales, na.rm = T) + 100, sales)) model Year sales <chr> <dbl> <dbl> 1 A 2015 450 2 B 2015 678 3 C 2015 456 4 D 2015 344 5 E 2015 984 6 F 2015 456 7 G 2015 234 8 H 2015 244 9 I 2015 655 10 J 2015 1084 # … with 50 more rows
base R option: df <- data.frame (model = c("A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J"), Year = c(2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020), sales = c(450,678,456,344,984,456,234,244,655,"NA",234,567,234,567,232,900,1005,1900,450,"NA",567,235,456,345,144,333,555,777,111,"NA",222,223,445,776,331,788,980,1003,456,"NA",345,2222,3456,456,678,8911,4560,4567,4566,"NA",6666,7777,8888,1233,1255,5677,3411,2344,6122,"NA")) df$sales <- as.numeric(df$sales) #> Warning: NAs introduced by coercion df$sales <- with(df, ave(sales, Year, FUN = function(x) ifelse(is.na(x) & model == "J", max(x, na.rm = TRUE) + 100, sales))) df #> model Year sales #> 1 A 2015 450 #> 2 B 2015 678 #> 3 C 2015 456 #> 4 D 2015 344 #> 5 E 2015 984 #> 6 F 2015 456 #> 7 G 2015 234 #> 8 H 2015 244 #> 9 I 2015 655 #> 10 J 2015 1084 #> 11 A 2016 450 #> 12 B 2016 678 #> 13 C 2016 456 #> 14 D 2016 344 #> 15 E 2016 984 #> 16 F 2016 456 #> 17 G 2016 234 #> 18 H 2016 244 #> 19 I 2016 655 #> 20 J 2016 2000 #> 21 A 2017 450 #> 22 B 2017 678 #> 23 C 2017 456 #> 24 D 2017 344 #> 25 E 2017 984 #> 26 F 2017 456 #> 27 G 2017 234 #> 28 H 2017 244 #> 29 I 2017 655 #> 30 J 2017 877 #> 31 A 2018 450 #> 32 B 2018 678 #> 33 C 2018 456 #> 34 D 2018 344 #> 35 E 2018 984 #> 36 F 2018 456 #> 37 G 2018 234 #> 38 H 2018 244 #> 39 I 2018 655 #> 40 J 2018 1103 #> 41 A 2019 450 #> 42 B 2019 678 #> 43 C 2019 456 #> 44 D 2019 344 #> 45 E 2019 984 #> 46 F 2019 456 #> 47 G 2019 234 #> 48 H 2019 244 #> 49 I 2019 655 #> 50 J 2019 9011 #> 51 A 2020 450 #> 52 B 2020 678 #> 53 C 2020 456 #> 54 D 2020 344 #> 55 E 2020 984 #> 56 F 2020 456 #> 57 G 2020 234 #> 58 H 2020 244 #> 59 I 2020 655 #> 60 J 2020 8988 Created on 2022-07-08 by the reprex package (v2.0.1)
How to change grouped data in ungrouped data
I have grouped data that I want to convert to ungrouped data. year<-c(rep(2014,4),rep(2015,4)) Age<-rep(c(22,23,24,25),2) n<-c(1,1,3,2,0,2,3,1) mydata<-data.frame(year,Age,n) I would like to have a dataset like the one below created from the previous one. year Age 1 2014 22 2 2014 23 3 2014 24 4 2014 24 5 2014 24 6 2014 25 7 2014 25 8 2015 23 9 2015 23 10 2015 24 11 2015 24 12 2015 24 13 2015 25
Try mydata[rep(1:nrow(mydata),mydata$n),] year Age n 1 2014 22 1 2 2014 23 1 3 2014 24 3 3.1 2014 24 3 3.2 2014 24 3 4 2014 25 2 4.1 2014 25 2 6 2015 23 2 6.1 2015 23 2 7 2015 24 3 7.1 2015 24 3 7.2 2015 24 3 8 2015 25 1
Here's a tidyverse solution: library(tidyverse) mydata %>% uncount(n) which gives: year Age 1 2014 22 2 2014 23 3 2014 24 4 2014 24 5 2014 24 6 2014 25 7 2014 25 8 2015 23 9 2015 23 10 2015 24 11 2015 24 12 2015 24 13 2015 25
You can also use tidyr syntax for this: library(tidyr) year<-c(rep(2014,4),rep(2015,4)) Age<-rep(c(22,23,24,25),2) n<-c(1,1,3,2,0,2,3,1) mydata<-data.frame(year,Age,n) uncount(mydata, n) #> year Age #> 1 2014 22 #> 2 2014 23 #> 3 2014 24 #> 4 2014 24 #> 5 2014 24 #> 6 2014 25 #> 7 2014 25 #> 8 2015 23 #> 9 2015 23 #> 10 2015 24 #> 11 2015 24 #> 12 2015 24 #> 13 2015 25 But of course you shouldn't use tidyr just because it is tidyr :) An alternate view of the Tidyverse "dialect" of the R language, and its promotion by RStudio.
We can use tidyr::complete library(tidyr) library(dplyr) mydata %>% group_by(year, Age) %>% complete(n = seq_len(n)) %>% select(-n) %>% ungroup() # A tibble: 14 × 2 year Age <dbl> <dbl> 1 2014 22 2 2014 23 3 2014 24 4 2014 24 5 2014 24 6 2014 25 7 2014 25 8 2015 23 9 2015 23 10 2015 24 11 2015 24 12 2015 24 13 2015 25 14 2015 22
Performing a rolling average with criteria in R
Been trying to learn the most basic of items at first and then expanding the complexity. So for this one, how would I modify the last line to where it would be create a rolling 12 month average for each seriescode. In this case, it would produce an average of 8 for seriescode 100 and 27 for seriescode 101. First, is the sample data Monthx<- c(201911,201912,20201 ,20202,20203,20204,20205,20206,20207 ,20208,20209,202010,202011,201911,201912,20201 ,20202,20203,20204,20205,20206,20207 ,20208,20209,202010,202011) empx <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,21,22,23,24,25,26,27,28,29,20,31,32,33) seriescode<-c(100,100,100,100,100,100,100,100,100,100,100,100,100,110,110,110,110,110,110,110,110,110,110,110,110,110) ces12x <- data.frame(Monthx,empx,seriescode) Manipulations library(dplyr) ces12x<- ces12x %>% mutate(year = substr(as.numeric(Monthx),1,4), month = substr(as.numeric(Monthx),5,7), date = as.Date(paste(year,month,"1",sep ="-"))) Month_ord <- order(Monthx) ces12x<-ces12x %>% mutate(ravg = zoo::rollmeanr(empx, 12, fill = NA))
You would just need to add a group_by(seriescode) which would then perform the mutate functions per seriescode: Monthx<- c(201911,201912,20201 ,20202,20203,20204,20205,20206,20207 ,20208,20209,202010,202011,201911,201912,20201 ,20202,20203,20204,20205,20206,20207 ,20208,20209,202010,202011) empx <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,21,22,23,24,25,26,27,28,29,20,31,32,33) seriescode<-c(100,100,100,100,100,100,100,100,100,100,100,100,100,110,110,110,110,110,110,110,110,110,110,110,110,110) ces12x <- data.frame(Monthx,empx,seriescode) ces12x<- ces12x %>% mutate(year = substr(as.numeric(Monthx),1,4), month = substr(as.numeric(Monthx),5,7), date = as.Date(paste(year,month,"1",sep ="-"))) Month_ord <- order(Monthx) ces12x<-ces12x %>% group_by(seriescode) %>% mutate(ravg = zoo::rollmeanr(empx, 12, fill = NA)) # add the group_by(seriescode) This produces the output: # A tibble: 26 x 7 # Groups: seriescode [2] Monthx empx seriescode year month date ravg <dbl> <dbl> <dbl> <chr> <chr> <date> <dbl> 1 201911 1 100 2019 11 2019-11-01 NA 2 201912 2 100 2019 12 2019-12-01 NA 3 20201 3 100 2020 1 2020-01-01 NA 4 20202 4 100 2020 2 2020-02-01 NA 5 20203 5 100 2020 3 2020-03-01 NA 6 20204 6 100 2020 4 2020-04-01 NA 7 20205 7 100 2020 5 2020-05-01 NA 8 20206 8 100 2020 6 2020-06-01 NA 9 20207 9 100 2020 7 2020-07-01 NA 10 20208 10 100 2020 8 2020-08-01 NA 11 20209 11 100 2020 9 2020-09-01 NA 12 202010 12 100 2020 10 2020-10-01 6.5 13 202011 13 100 2020 11 2020-11-01 7.5 14 201911 21 110 2019 11 2019-11-01 NA 15 201912 22 110 2019 12 2019-12-01 NA 16 20201 23 110 2020 1 2020-01-01 NA 17 20202 24 110 2020 2 2020-02-01 NA 18 20203 25 110 2020 3 2020-03-01 NA 19 20204 26 110 2020 4 2020-04-01 NA 20 20205 27 110 2020 5 2020-05-01 NA 21 20206 28 110 2020 6 2020-06-01 NA 22 20207 29 110 2020 7 2020-07-01 NA 23 20208 20 110 2020 8 2020-08-01 NA 24 20209 31 110 2020 9 2020-09-01 NA 25 202010 32 110 2020 10 2020-10-01 25.7 26 202011 33 110 2020 11 2020-11-01 26.7
If you want to continue using the tidyverse for this, the following should do the trick: library(dplyr) ces12x %>% group_by(seriescode) %>% arrange(date) %>% slice(tail(row_number(), 12)) %>% summarize(ravg = mean(empx))
R: How to calculate rates between observations by group?
Say we have a dataframe like the following one: cohort month customers Jan 01 523 Jan 02 332 Jan 03 221 Jan 04 190 Feb 02 489 Feb 03 310 Feb 04 205 Mar 03 372 Mar 04 192 Apr 04 340 My aim is to create a brand new column storing the retention rate for each cohort. To do this, I'd need to calculate how many customers do remain in the last month (04) related to the total ones joining each cohort for the first time. I'm striving with dplyr to achieve two tables that would look like the ones below: One calculating the current retention rate for each cohort: cohort rr Jan 0.36 Feb 0.42 Mar 0.52 And, perhaps the most important one, another that could give me the RR evolution in a monthly basis like the one below: cohort month customers period rr Jan 01 523 0 1 Jan 02 332 1 0.63 Jan 03 221 2 0.42 Jan 04 190 3 0.36 Feb 02 489 0 1 Feb 03 310 1 0.63 Feb 04 205 2 0.42 Mar 03 372 0 1 Mar 04 192 1 0.52 Apr 04 340 0 1
One dplyr option could be: df %>% group_by(cohort) %>% mutate(period = 1:n() - 1, rr = customers/first(customers)) cohort month customers period rr <chr> <int> <int> <dbl> <dbl> 1 Jan 1 523 0 1 2 Jan 2 332 1 0.635 3 Jan 3 221 2 0.423 4 Jan 4 190 3 0.363 5 Feb 2 489 0 1 6 Feb 3 310 1 0.634 7 Feb 4 205 2 0.419 8 Mar 3 372 0 1 9 Mar 4 192 1 0.516 10 Apr 4 340 0 1 For the second table: df %>% group_by(cohort) %>% summarise(rr = last(customers)/first(customers)) cohort rr <chr> <dbl> 1 Apr 1 2 Feb 0.419 3 Jan 0.363 4 Mar 0.516
Does this work: df %>% group_by(cohort) %>% summarise(rr = sum(customers[n()])/customers[1]) `summarise()` ungrouping output (override with `.groups` argument) # A tibble: 4 x 2 cohort rr <chr> <dbl> 1 Apr 1 2 Feb 0.419 3 Jan 0.363 4 Mar 0.516 For second one, another take: df %>% group_by(cohort) %>% mutate(period = 0:(n()-1), rr = customers/customers[1]) # A tibble: 10 x 5 # Groups: cohort [4] cohort month customers period rr <chr> <chr> <dbl> <int> <dbl> 1 Jan 01 523 0 1 2 Jan 02 332 1 0.635 3 Jan 03 221 2 0.423 4 Jan 04 190 3 0.363 5 Feb 02 489 0 1 6 Feb 03 310 1 0.634 7 Feb 04 205 2 0.419 8 Mar 03 372 0 1 9 Mar 04 192 1 0.516 10 Apr 04 340 0 1