Say we have a dataframe like the following one:
cohort month customers
Jan 01 523
Jan 02 332
Jan 03 221
Jan 04 190
Feb 02 489
Feb 03 310
Feb 04 205
Mar 03 372
Mar 04 192
Apr 04 340
My aim is to create a brand new column storing the retention rate for each cohort. To do this, I'd need to calculate how many customers do remain in the last month (04) related to the total ones joining each cohort for the first time.
I'm striving with dplyr to achieve two tables that would look like the ones below:
One calculating the current retention rate for each cohort:
cohort rr
Jan 0.36
Feb 0.42
Mar 0.52
And, perhaps the most important one, another that could give me the RR evolution in a monthly basis like the one below:
cohort month customers period rr
Jan 01 523 0 1
Jan 02 332 1 0.63
Jan 03 221 2 0.42
Jan 04 190 3 0.36
Feb 02 489 0 1
Feb 03 310 1 0.63
Feb 04 205 2 0.42
Mar 03 372 0 1
Mar 04 192 1 0.52
Apr 04 340 0 1
One dplyr option could be:
df %>%
group_by(cohort) %>%
mutate(period = 1:n() - 1,
rr = customers/first(customers))
cohort month customers period rr
<chr> <int> <int> <dbl> <dbl>
1 Jan 1 523 0 1
2 Jan 2 332 1 0.635
3 Jan 3 221 2 0.423
4 Jan 4 190 3 0.363
5 Feb 2 489 0 1
6 Feb 3 310 1 0.634
7 Feb 4 205 2 0.419
8 Mar 3 372 0 1
9 Mar 4 192 1 0.516
10 Apr 4 340 0 1
For the second table:
df %>%
group_by(cohort) %>%
summarise(rr = last(customers)/first(customers))
cohort rr
<chr> <dbl>
1 Apr 1
2 Feb 0.419
3 Jan 0.363
4 Mar 0.516
Does this work:
df %>% group_by(cohort) %>% summarise(rr = sum(customers[n()])/customers[1])
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 2
cohort rr
<chr> <dbl>
1 Apr 1
2 Feb 0.419
3 Jan 0.363
4 Mar 0.516
For second one, another take:
df %>% group_by(cohort) %>% mutate(period = 0:(n()-1), rr = customers/customers[1])
# A tibble: 10 x 5
# Groups: cohort [4]
cohort month customers period rr
<chr> <chr> <dbl> <int> <dbl>
1 Jan 01 523 0 1
2 Jan 02 332 1 0.635
3 Jan 03 221 2 0.423
4 Jan 04 190 3 0.363
5 Feb 02 489 0 1
6 Feb 03 310 1 0.634
7 Feb 04 205 2 0.419
8 Mar 03 372 0 1
9 Mar 04 192 1 0.516
10 Apr 04 340 0 1
Related
I have a dataset where I have been able to loop over different test values with dpois. For simplicity's sake, I have used an average of 4 events per month and I wanted to know what is the likelihood of n or more events, given the average. Here is what I have managed to make work:
MonthlyAverage <- 4
cnt <- c(0:10)
for (i in cnt) {
CountProb <- ppois(cnt,MonthlyAverage,lower.tail=FALSE)
}
dfProb <- data.frame(cnt,CountProb)
I am interested in investigating this to figure out how many events I may expect each month given the mean of that month.
I would be looking to say:
For January, what is the probability of 0
For January, what is the probability of 1
For January, what is the probability of 2
etc...
For February, what is the probability of 0
For February, what is the probability of 1
For February, what is the probability of 2
etc.
To give something like (numbers here are just an example):
I thought about trying one loop to select the correct month and then remove the month column so I am just left with the single "Monthly Average" value and then performing the count loop, but that doesn't seem to work. I still get "Non-numeric argument to mathematical function". I feel like I'm close, but can anyone please point me in the right direction for the formatting?
a "tidy-style" solution:
library(tidyr)
library(dplyr)
## example data:
df <- data.frame(Month = c('Jan', 'Feb'),
MonthlyAverage = c(5, 2)
)
> df
Month MonthlyAverage
1 Jan 5
2 Feb 2
df |>
mutate(n = list(1:10)) |>
unnest_longer(n) |>
mutate(CountProb = ppois(n, MonthlyAverage,
lower.tail=FALSE
)
)
# A tibble: 20 x 4
Month MonthlyAverage n CountProb
<chr> <dbl> <int> <dbl>
1 Jan 5 1 0.960
2 Jan 5 2 0.875
3 Jan 5 3 0.735
4 Jan 5 4 0.560
5 Jan 5 5 0.384
6 Jan 5 6 0.238
## ...
How about something like this:
cnt <- 0:10
MonthlyAverage <- c(1.8, 1.56, 2.44, 1.86, 2.1, 2.3, 2, 2.78, 1.89, 1.86, 1.4, 1.71)
grid <- expand.grid(cnt =cnt, m_num = 1:12)
grid$MonthlyAverage <- MonthlyAverage[grid$m_num]
mnames <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
grid$month <- mnames[grid$m_num]
grid$prob <- ppois(grid$cnt, grid$MonthlyAverage, lower.tail=FALSE)
grid[,c("month", "cnt", "prob")]
#> month cnt prob
#> 1 Jan 0 8.347011e-01
#> 2 Jan 1 5.371631e-01
#> 3 Jan 2 2.693789e-01
#> 4 Jan 3 1.087084e-01
#> 5 Jan 4 3.640666e-02
#> 6 Jan 5 1.037804e-02
#> 7 Jan 6 2.569450e-03
#> 8 Jan 7 5.615272e-04
#> 9 Jan 8 1.097446e-04
#> 10 Jan 9 1.938814e-05
#> 11 Jan 10 3.123964e-06
#> 12 Feb 0 7.898639e-01
#> 13 Feb 1 4.620517e-01
#> 14 Feb 2 2.063581e-01
#> 15 Feb 3 7.339743e-02
#> 16 Feb 4 2.154277e-02
#> 17 Feb 5 5.364120e-03
#> 18 Feb 6 1.157670e-03
#> 19 Feb 7 2.202330e-04
#> 20 Feb 8 3.743272e-05
#> 21 Feb 9 5.747339e-06
#> 22 Feb 10 8.044197e-07
#> 23 Mar 0 9.128391e-01
#> 24 Mar 1 7.001667e-01
#> 25 Mar 2 4.407062e-01
#> 26 Mar 3 2.296784e-01
#> 27 Mar 4 1.009515e-01
#> 28 Mar 5 3.813271e-02
#> 29 Mar 6 1.258642e-02
#> 30 Mar 7 3.681711e-03
#> 31 Mar 8 9.657751e-04
#> 32 Mar 9 2.294546e-04
#> 33 Mar 10 4.979244e-05
#> 34 Apr 0 8.443274e-01
#> 35 Apr 1 5.547763e-01
#> 36 Apr 2 2.854938e-01
#> 37 Apr 3 1.185386e-01
#> 38 Apr 4 4.090445e-02
#> 39 Apr 5 1.202455e-02
#> 40 Apr 6 3.071778e-03
#> 41 Apr 7 6.928993e-04
#> 42 Apr 8 1.398099e-04
#> 43 Apr 9 2.550478e-05
#> 44 Apr 10 4.244028e-06
#> 45 May 0 8.775436e-01
#> 46 May 1 6.203851e-01
#> 47 May 2 3.503686e-01
#> 48 May 3 1.613572e-01
#> 49 May 4 6.212612e-02
#> 50 May 5 2.044908e-02
#> 51 May 6 5.862118e-03
#> 52 May 7 1.486029e-03
#> 53 May 8 3.373058e-04
#> 54 May 9 6.927041e-05
#> 55 May 10 1.298297e-05
#> 56 Jun 0 8.997412e-01
#> 57 Jun 1 6.691458e-01
#> 58 Jun 2 4.039612e-01
#> 59 Jun 3 2.006529e-01
#> 60 Jun 4 8.375072e-02
#> 61 Jun 5 2.997569e-02
#> 62 Jun 6 9.361934e-03
#> 63 Jun 7 2.588841e-03
#> 64 Jun 8 6.415773e-04
#> 65 Jun 9 1.439431e-04
#> 66 Jun 10 2.948727e-05
#> 67 Jul 0 8.646647e-01
#> 68 Jul 1 5.939942e-01
#> 69 Jul 2 3.233236e-01
#> 70 Jul 3 1.428765e-01
#> 71 Jul 4 5.265302e-02
#> 72 Jul 5 1.656361e-02
#> 73 Jul 6 4.533806e-03
#> 74 Jul 7 1.096719e-03
#> 75 Jul 8 2.374473e-04
#> 76 Jul 9 4.649808e-05
#> 77 Jul 10 8.308224e-06
#> 78 Aug 0 9.379615e-01
#> 79 Aug 1 7.654944e-01
#> 80 Aug 2 5.257652e-01
#> 81 Aug 3 3.036162e-01
#> 82 Aug 4 1.492226e-01
#> 83 Aug 5 6.337975e-02
#> 84 Aug 6 2.360590e-02
#> 85 Aug 7 7.809999e-03
#> 86 Aug 8 2.320924e-03
#> 87 Aug 9 6.254093e-04
#> 88 Aug 10 1.540564e-04
#> 89 Sep 0 8.489282e-01
#> 90 Sep 1 5.634025e-01
#> 91 Sep 2 2.935807e-01
#> 92 Sep 3 1.235929e-01
#> 93 Sep 4 4.327373e-02
#> 94 Sep 5 1.291307e-02
#> 95 Sep 6 3.349459e-03
#> 96 Sep 7 7.672845e-04
#> 97 Sep 8 1.572459e-04
#> 98 Sep 9 2.913775e-05
#> 99 Sep 10 4.925312e-06
#> 100 Oct 0 8.443274e-01
#> 101 Oct 1 5.547763e-01
#> 102 Oct 2 2.854938e-01
#> 103 Oct 3 1.185386e-01
#> 104 Oct 4 4.090445e-02
#> 105 Oct 5 1.202455e-02
#> 106 Oct 6 3.071778e-03
#> 107 Oct 7 6.928993e-04
#> 108 Oct 8 1.398099e-04
#> 109 Oct 9 2.550478e-05
#> 110 Oct 10 4.244028e-06
#> 111 Nov 0 7.534030e-01
#> 112 Nov 1 4.081673e-01
#> 113 Nov 2 1.665023e-01
#> 114 Nov 3 5.372525e-02
#> 115 Nov 4 1.425330e-02
#> 116 Nov 5 3.201149e-03
#> 117 Nov 6 6.223149e-04
#> 118 Nov 7 1.065480e-04
#> 119 Nov 8 1.628881e-05
#> 120 Nov 9 2.248494e-06
#> 121 Nov 10 2.828495e-07
#> 122 Dec 0 8.191342e-01
#> 123 Dec 1 5.098537e-01
#> 124 Dec 2 2.454189e-01
#> 125 Dec 3 9.469102e-02
#> 126 Dec 4 3.025486e-02
#> 127 Dec 5 8.217692e-03
#> 128 Dec 6 1.937100e-03
#> 129 Dec 7 4.028407e-04
#> 130 Dec 8 7.489285e-05
#> 131 Dec 9 1.258275e-05
#> 132 Dec 10 1.927729e-06
Created on 2023-01-09 by the reprex package (v2.0.1)
If you have each month's mean, in base R you could easily use sapply to estimate the probability of obtaining values 0 to 10 using each month's mean value. Then you can simply combine it in a data frame:
# Data
df <- data.frame(month = month.name,
mean = c(1.8, 2.8, 1.7, 1.6, 1.8, 2,
2.3, 2.4, 2.1, 1.4, 1.9, 1.9))
probs <- sapply(1:12, function(x) ppois(0:10, df$mean[x], lower.tail = FALSE))
finaldata <- data.frame(month = rep(month.name, each = 11),
events = rep(0:10, times = 12),
prob = prob = as.vector(probs))
Output:
# month events prob
# 1 January 0 8.347011e-01
# 2 January 1 5.371631e-01
# 3 January 2 2.693789e-01
# 4 January 3 1.087084e-01
# 5 January 4 3.640666e-02
# 6 January 5 1.037804e-02
# 7 January 6 2.569450e-03
# 8 January 7 5.615272e-04
# 9 January 8 1.097446e-04
# 10 January 9 1.938814e-05
# 11 January 10 3.123964e-06
# 12 February 0 9.391899e-01
# 13 February 1 7.689218e-01
# 14 February 2 5.305463e-01
# 15 February 3 3.080626e-01
# ...
# 131 December 9 3.044317e-05
# 132 December 10 5.172695e-06
I'm trying to find lag using gg_lag but I keep getting the same error regarding my tsibble
# A tsibble: 255 x 6 [7D]
# Key: Demand [163]
Week Demand Date Month year Quarter
<dbl> <dbl> <date> <mth> <chr> <qtr>
1 1 48 2018-01-01 2018 Jan 2018 2018 Q1
2 2 101 2018-01-08 2018 Jan 2018 2018 Q1
3 3 129 2018-01-15 2018 Jan 2018 2018 Q1
4 4 113 2018-01-22 2018 Jan 2018 2018 Q1
5 5 116 2018-01-29 2018 Jan 2018 2018 Q1
6 6 123 2018-02-05 2018 Feb 2018 2018 Q1
7 7 137 2018-02-12 2018 Feb 2018 2018 Q1
8 8 136 2018-02-19 2018 Feb 2018 2018 Q1
9 9 151 2018-02-26 2018 Feb 2018 2018 Q1
10 10 87 2018-03-05 2018 Mar 2018 2018 Q1
# ... with 245 more rows
Printer_Q %>% gg_lag(Demand, geom='point')
Error: The data provided to contains more than one time series. Please filter a single time series to use gg_lag()
I tried filtering my data with:
Printer_Q <- Demandts %>%
select(-Week, -year, -Month, -Quarter)
...so that I am left with Demand and Date but it still says I have more than one time series? What am I doing wrong?
The Demand column should not be a key variable. A key variable is a categorical variable used to distinguish multiple time series in a single tsibble. It appears you just have one time series here, so you don't need a key variable.
This question already has answers here:
Summarizing by subgroup percentage in R
(2 answers)
Closed 2 years ago.
Say we have a dataframe looking like this one below:
month issue amount
Jan withdrawal 250
Jan delay 120
Jan other 65
Feb withdrawal 189
Feb delay 122
Feb other 89
My goal is to tweak this in order to get a dataframe giving me the percentages of each value in issue related to each month. Plainly speaking, my desired output should look like follows:
month issue rate
Jan withdrawal 57.47
Jan delay 27.59
Jan other 14.94
Feb withdrawal 47.25
Feb delay 30.50
Feb other 22.25
I've tried helping myself with dplyr but my attempts had all been unsuccessful so far.
library(tidyverse)
df <- read.table(text = "month issue amount
Jan withdrawal 250
Jan delay 120
Jan other 65
Feb withdrawal 189
Feb delay 122
Feb other 89", header = T)
df
#> month issue amount
#> 1 Jan withdrawal 250
#> 2 Jan delay 120
#> 3 Jan other 65
#> 4 Feb withdrawal 189
#> 5 Feb delay 122
#> 6 Feb other 89
df %>%
group_by(month) %>%
mutate(rate = amount / sum(amount, na.rm = T) * 100)
#> # A tibble: 6 x 4
#> # Groups: month [2]
#> month issue amount rate
#> <chr> <chr> <int> <dbl>
#> 1 Jan withdrawal 250 57.5
#> 2 Jan delay 120 27.6
#> 3 Jan other 65 14.9
#> 4 Feb withdrawal 189 47.2
#> 5 Feb delay 122 30.5
#> 6 Feb other 89 22.2
df %>%
group_by(month) %>%
mutate(rate = prop.table(amount) * 100)
#> # A tibble: 6 x 4
#> # Groups: month [2]
#> month issue amount rate
#> <chr> <chr> <int> <dbl>
#> 1 Jan withdrawal 250 57.5
#> 2 Jan delay 120 27.6
#> 3 Jan other 65 14.9
#> 4 Feb withdrawal 189 47.2
#> 5 Feb delay 122 30.5
#> 6 Feb other 89 22.2
Created on 2021-01-26 by the reprex package (v0.3.0)
using data.table
library(data.table)
setDT(df)
df[, rate := prop.table(amount) * 100, by = list(month)]
df
#> month issue amount rate
#> 1: Jan withdrawal 250 57.47126
#> 2: Jan delay 120 27.58621
#> 3: Jan other 65 14.94253
#> 4: Feb withdrawal 189 47.25000
#> 5: Feb delay 122 30.50000
#> 6: Feb other 89 22.25000
Created on 2021-01-26 by the reprex package (v0.3.0)
Try the answer here . it might be what you're looking for. Code is pasted below for your convenience.
library(dplyr)
group_by(df, group) %>% mutate(percent = value/sum(value))
In your case it would probably be something like :
library(dplyr)
group_by(df, month) %>% mutate(rate= amount/sum(amount))
Or alternatively :
group_by(df, month) %>% transmute(issue, rate= amount/sum(amount))
i have a very big dataframe from which i need the lossyear per Point:
# A tibble: 74,856 x 13
Date index Mean Sdev Median pixel_used doy Month Year_n Year lossyear Point Scene
<date> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <dbl> <int> <int> <int> <chr>
1 2013-06-11 NBR 0.481 0.0832 0.496 92647 162 6 2013 2013 2017 1 LC08_125016
2 2013-06-11 NDMI 0.175 0.0737 0.189 92647 162 6 2013 2013 2017 1 LC08_125016
3 2013-06-11 NDVI 0.734 0.0517 0.741 92647 162 6 2013 2013 2017 1 LC08_125016
4 2013-06-11 TCB 0.237 0.0159 0.235 92647 162 6 2013 2013 2017 1 LC08_125016
5 2013-06-11 TCG 0.158 0.0174 0.158 92647 162 6 2013 2013 2017 1 LC08_125016
6 2013-06-11 TCW -0.0958 0.0195 -0.0903 92647 162 6 2013 2013 2017 1 LC08_125016
7 2013-06-27 NBR 0.524 0.0503 0.525 39323 178 6 2013 2013 2017 1 LC08_125016
8 2013-06-27 NDMI 0.234 0.0464 0.236 39323 178 6 2013 2013 2017 1 LC08_125016
9 2013-06-27 NDVI 0.721 0.0351 0.725 39323 178 6 2013 2013 2017 1 LC08_125016
10 2013-06-27 TCB 0.249 0.0299 0.251 39323 178 6 2013 2013 2017 1 LC08_125016
# ... with 74,846 more rows
I was able to create a subset by row df[,c("lossyear", "Point")]:
# A tibble: 74,856 x 2
Point lossyear
<fct> <fct>
1 1 2017
2 1 2017
3 1 2017
4 1 2017
5 1 2017
6 1 2017
7 1 2017
8 1 2017
9 1 2017
10 1 2017
# ... with 74,846 more rows
But how do i "shorten" it, so that i have only 1 Row per unique Point which the corresponding lossyear (2000:2017)? Something like this:
# A tibble: 42 x 2
Point lossyear
<fct> <fct>
1 1 2017
2 2 2017
3 3 2017
4 4 2016
5 5 2016
6 6 2016
7 7 2015
8 8 2014
9 9 2014
10 10 2014
# ... with 32 more rows
We can use distinct to get the unique elements of the selected columns
library(dplyr)
df %>%
distinct(lossyear, Point)
You could group by Pointand get the first value via slice:
library(dplyr)
df %>% select(lossyear, Point)
%>% group_by(Point)
%>% slice(1) %>% ungroupt
I have this data.frame:
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
First 20 rows of data:
head(counts, 20)
year month count
1 2000 Jan 14
2 2000 Feb 182
3 2000 Mar 462
4 2000 Apr 395
5 2000 May 107
6 2000 Jun 127
7 2000 Jul 371
8 2000 Aug 158
9 2000 Sep 147
10 2000 Oct 41
11 2000 Nov 141
12 2000 Dec 27
13 2001 Jan 72
14 2001 Feb 7
15 2001 Mar 40
16 2001 Apr 351
17 2001 May 342
18 2001 Jun 81
19 2001 Jul 442
20 2001 Aug 389
Lets say I try to calculate the standard deviation of these data using the usual R code:
library(plyr)
ddply(counts, .(month), summarise, s.d. = sd(count))
month s.d.
1 Apr 145.3018
2 Aug 140.9949
3 Dec 173.9406
4 Feb 127.5296
5 Jan 148.2661
6 Jul 162.4893
7 Jun 133.4383
8 Mar 125.8425
9 May 168.9517
10 Nov 93.1370
11 Oct 167.9436
12 Sep 166.8740
This gives the standard deviation around the mean of each month. How can I get R to output standard deviation around maximum value of each month?
you want: "max of values per month and the average from this maximum value" [which is not the same as the standard deviation].
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
library(data.table)
counts=data.table(counts)
counts[,mean(count-max(count)),by=month]
This question is highly vague. If you want to calculate the standard deviation of the differences to the maximum, you can use this code:
> library(plyr)
> ddply(counts, .(month), summarise, sd = sd(count - max(count)))
month sd
1 Apr 182.5071
2 Aug 114.3068
3 Dec 117.1049
4 Feb 184.4638
5 Jan 138.1755
6 Jul 167.0677
7 Jun 100.8841
8 Mar 144.8724
9 May 173.3452
10 Nov 132.0204
11 Oct 127.4645
12 Sep 152.2162