Considering the following dataset:
Company name
Year
Customers
Company A
2018
100
Company B
2018
120
Company C
2018
150
Company A
2019
120
Company B
2019
180
Company C
2019
80
Company A
2020
200
Company B
2020
500
Company C
2020
140
What I want to do is to measure the future return of the customers. So, I need to have the customer amount of next year in a new column. Something like this:
Company name
Year
Customers
Customers next year
Company A
2018
100
120
Company B
2018
120
180
Company C
2018
150
80
Company A
2019
120
200
Company B
2019
180
500
Company C
2019
80
140
Company A
2020
200
NA
Company B
2020
500
NA
Company C
2020
140
NA
Does anybody have any idea how to do this?
You can use lead like in the following code:
library(dplyr)
df %>%
group_by(Company) %>%
mutate(customers_next_year = lead(Customers)) %>%
ungroup()
Output:
# A tibble: 9 × 4
# Groups: Company [3]
Company Year Customers customers_next_year
<chr> <dbl> <dbl> <dbl>
1 A 2018 100 120
2 B 2018 120 180
3 C 2018 150 80
4 A 2019 120 200
5 B 2019 180 500
6 C 2019 80 140
7 A 2020 200 NA
8 B 2020 500 NA
9 C 2020 140 NA
Data
df <- data.frame(Company = rep(c("A", "B", "C"), 3),
Year = c(2018, 2018, 2018, 2019, 2019, 2019, 2020, 2020, 2020),
Customers = c(100,120,150,120,180,80,200,500,140))
A method without external pacakges:
within(df, {
customers_next_year <- ave(Customers, Company, FUN = \(x) c(x[-1], NA))
})
# Company Year Customers customers_next_year
# 1 A 2018 100 120
# 2 B 2018 120 180
# 3 C 2018 150 80
# 4 A 2019 120 200
# 5 B 2019 180 500
# 6 C 2019 80 140
# 7 A 2020 200 NA
# 8 B 2020 500 NA
# 9 C 2020 140 NA
Related
Consider this data
data <- data.frame(
group = c(rep("A", 10), rep("B", 10)),
year = rep(2016:2025, 2),
value = c(10, 20, 30, 40, NA, NA, NA, NA, NA, NA,
70, 80, NA, NA, NA, NA, NA, NA, NA, NA)
)
data
#> group year value
#> 1 A 2016 10
#> 2 A 2017 20
#> 3 A 2018 30
#> 4 A 2019 40
#> 5 A 2020 NA
#> 6 A 2021 NA
#> 7 A 2022 NA
#> 8 A 2023 NA
#> 9 A 2024 NA
#> 10 A 2025 NA
#> 11 B 2016 70
#> 12 B 2017 80
#> 13 B 2018 NA
#> 14 B 2019 NA
#> 15 B 2020 NA
#> 16 B 2021 NA
#> 17 B 2022 NA
#> 18 B 2023 NA
#> 19 B 2024 NA
#> 20 B 2025 NA
Now I’d like to fill in the missing values applying an arbitrary function
to the last non-missing value, recursively.
For example, let’s say that I want to
let the values increase by 10 points, yearly.
So, for rows where value is not NA, it should remain unmodified.
Starting from the row where value is NA, it applies that valuet = valuet − 1 + 10
A naive attempt to do it would be to use dplyr::lag,
but this only works for the first missing value because lag is vectorized and
operates on the value vector and do not recurse over the previous values
library(dplyr)
data |>
group_by(group) |>
mutate(value_fix = dplyr::lag(value) + 10)
#> # A tibble: 20 × 4
#> # Groups: group [2]
#> group year value value_fix
#> <chr> <int> <dbl> <dbl>
#> 1 A 2016 10 NA
#> 2 A 2017 20 20
#> 3 A 2018 30 30
#> 4 A 2019 40 40
#> 5 A 2020 NA 50
#> 6 A 2021 NA NA
#> 7 A 2022 NA NA
#> 8 A 2023 NA NA
#> 9 A 2024 NA NA
#> 10 A 2025 NA NA
#> 11 B 2016 70 NA
#> 12 B 2017 80 80
#> 13 B 2018 NA 90
#> 14 B 2019 NA NA
#> 15 B 2020 NA NA
#> 16 B 2021 NA NA
#> 17 B 2022 NA NA
#> 18 B 2023 NA NA
#> 19 B 2024 NA NA
#> 20 B 2025 NA NA
This is where I thought tidyr could help, because it is somewhat similar
to tidyr::fill
data |>
group_by(group) |>
tidyr::fill(value)
but ideally, with an .f argument to be applied recursively to the last value.
There does not seem to be something like that.
Googling around I came up with this solution
data |>
group_by(group) |>
mutate(last_value = case_when(
value == dplyr::last(na.omit(value)) ~ value,
TRUE ~ NA_real_
)) |>
mutate(value_fix = purrr::accumulate(
.x = last_value,
.f = ~ coalesce(.x + 10, .y)
))
#> # A tibble: 20 × 5
#> # Groups: group [2]
#> group year value last_value value_fix
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A 2016 10 NA NA
#> 2 A 2017 20 NA NA
#> 3 A 2018 30 NA NA
#> 4 A 2019 40 40 40
#> 5 A 2020 NA NA 50
#> 6 A 2021 NA NA 60
#> 7 A 2022 NA NA 70
#> 8 A 2023 NA NA 80
#> 9 A 2024 NA NA 90
#> 10 A 2025 NA NA 100
#> 11 B 2016 70 NA NA
#> 12 B 2017 80 80 80
#> 13 B 2018 NA NA 90
#> 14 B 2019 NA NA 100
#> 15 B 2020 NA NA 110
#> 16 B 2021 NA NA 120
#> 17 B 2022 NA NA 130
#> 18 B 2023 NA NA 140
#> 19 B 2024 NA NA 150
#> 20 B 2025 NA NA 160
Which works, but seems kind of hacky and not easy to read. It would be cleaner to just write the loop and be happy.
I really thought for such a simple case, there would be a built-in way (vectorized, readable-code) to do it in the tidyverse. But I could not find any. Am I missing something?, any ideas how to better do this?
Created on 2022-08-30 with reprex v2.0.2
custom_fun <- function(x, y) {
if(is.na(y)) x + 10 else y
}
data %>%
group_by(group)%>%
mutate(value = accumulate(value, custom_fun))
# Groups: group [2]
group year value
<chr> <int> <dbl>
1 A 2016 10
2 A 2017 20
3 A 2018 30
4 A 2019 40
5 A 2020 50
6 A 2021 60
7 A 2022 70
8 A 2023 80
9 A 2024 90
10 A 2025 100
11 B 2016 70
12 B 2017 80
13 B 2018 90
14 B 2019 100
15 B 2020 110
16 B 2021 120
17 B 2022 130
18 B 2023 140
19 B 2024 150
20 B 2025 160
A base solution with ave() + Reduce(accumulate = TRUE):
transform(data, value = ave(value, group, FUN = \(val) {
Reduce(\(x, y) if(is.na(y)) x + 10 else y, val, accumulate = TRUE)
}))
group year value
1 A 2016 10
2 A 2017 20
3 A 2018 30
4 A 2019 40
5 A 2020 50
6 A 2021 60
7 A 2022 70
8 A 2023 80
9 A 2024 90
10 A 2025 100
11 B 2016 70
12 B 2017 80
13 B 2018 90
14 B 2019 100
15 B 2020 110
16 B 2021 120
17 B 2022 130
18 B 2023 140
19 B 2024 150
20 B 2025 160
You could also write a vectorized function:
fun <- function(x){
idx <- is.na(x)
b <- rle(idx)
id2 <- cumsum(b$lengths)[!b$values]
x[idx] <- sequence(b$lengths[b$values], x[id2] + 10, by=10)
x
}
transform(data, value = fun(value))
group year value
1 A 2016 10
2 A 2017 20
3 A 2018 30
4 A 2019 40
5 A 2020 50
6 A 2021 60
7 A 2022 70
8 A 2023 80
9 A 2024 90
10 A 2025 100
11 B 2016 70
12 B 2017 80
13 B 2018 90
14 B 2019 100
15 B 2020 110
16 B 2021 120
17 B 2022 130
18 B 2023 140
19 B 2024 150
20 B 2025 160
I have a following DF (demo). I would like to find the previous 3 month moving average of Amount column per ID, Year and Month.
ID YEAR MONTH AMOUNT
1 ABC 2020 09 100
2 ABC 2020 11 200
3 ABC 2020 12 300
4 ABC 2021 01 400
5 ABC 2021 04 500
6 PQR 2020 10 100
7 PQR 2020 11 200
8 PQR 2020 12 300
9 PQR 2021 01 400
10 PQR 2021 03 500
Following is an attempt.
library(TTR)
library(dplyr)
DF %>% group_by(ID, YEAR, MONTH) %>% mutate(3MA = runMean(AMOUNT, 3))
resulting in error with n=3 is outside valid range.
Desired Output:
ID YEAR MONTH AMOUNT 3MA
1 ABC 2020 09 100 NA
2 ABC 2020 11 200 NA
3 ABC 2020 12 300 NA
4 ABC 2021 01 400 200 (100+200+300)/3
5 ABC 2021 04 500 300 (400+300+200)/3
6 PQR 2020 10 100 NA
7 PQR 2020 11 200 NA
8 PQR 2020 12 300 NA
9 PQR 2021 01 400 200 (100+200+300)/3
10 PQR 2021 03 500 300 (400+300+200)/3
You can use the following code:
library(dplyr)
arrange(DF,ID,YEAR) %>%
group_by(ID) %>%
mutate(lag1=lag(AMOUNT),
lag2=lag(AMOUNT,2),
lag3=lag(AMOUNT,3),
movave=(lag1+lag2+lag3)/3)
#> # A tibble: 10 × 8
#> # Groups: ID [2]
#> ID YEAR MONTH AMOUNT lag1 lag2 lag3 movave
#> <chr> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 ABC 2020 9 100 NA NA NA NA
#> 2 ABC 2020 11 200 100 NA NA NA
#> 3 ABC 2020 12 300 200 100 NA NA
#> 4 ABC 2021 1 400 300 200 100 200
#> 5 ABC 2021 4 500 400 300 200 300
#> 6 PQR 2020 10 100 NA NA NA NA
#> 7 PQR 2020 11 200 100 NA NA NA
#> 8 PQR 2020 12 300 200 100 NA NA
#> 9 PQR 2021 1 400 300 200 100 200
#> 10 PQR 2021 3 500 400 300 200 300
Created on 2022-07-02 by the reprex package (v2.0.1)
An option using a sliding window:
library(tidyverse)
library(slider)
df <- tribble(
~id, ~year, ~month, ~amount,
"ABC", 2020, 09, 100,
"ABC", 2020, 11, 200,
"ABC", 2020, 12, 300,
"ABC", 2021, 01, 400,
"ABC", 2021, 04, 500,
"PQR", 2020, 10, 100,
"PQR", 2020, 11, 200,
"PQR", 2020, 12, 300,
"PQR", 2021, 01, 400,
"PQR", 2021, 03, 500
)
df |>
arrange(id, year, month) |>
group_by(id) |>
mutate(ma3 = slide_dbl(lag(amount), mean, .before = 2, complete = TRUE)) |>
ungroup() # if needed
#> # A tibble: 10 × 5
#> id year month amount ma3
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC 2020 9 100 NA
#> 2 ABC 2020 11 200 NA
#> 3 ABC 2020 12 300 NA
#> 4 ABC 2021 1 400 200
#> 5 ABC 2021 4 500 300
#> 6 PQR 2020 10 100 NA
#> 7 PQR 2020 11 200 NA
#> 8 PQR 2020 12 300 NA
#> 9 PQR 2021 1 400 200
#> 10 PQR 2021 3 500 300
Created on 2022-07-02 by the reprex package (v2.0.1)
Here is a way.
suppressPackageStartupMessages({
library(dplyr)
library(TTR)
})
x <- ' ID YEAR MONTH AMOUNT
1 ABC 2020 09 100
2 ABC 2020 11 200
3 ABC 2020 12 300
4 ABC 2021 01 400
5 ABC 2021 04 500
6 PQR 2020 10 100
7 PQR 2020 11 200
8 PQR 2020 12 300
9 PQR 2021 01 400
10 PQR 2021 03 500 '
DF <- read.table(textConnection(x), header = TRUE)
DF %>%
arrange(ID, YEAR, MONTH) %>%
group_by(ID) %>%
mutate(`3MA` = lag(runMean(AMOUNT, 3)))
#> # A tibble: 10 × 5
#> # Groups: ID [2]
#> ID YEAR MONTH AMOUNT `3MA`
#> <chr> <int> <int> <int> <dbl>
#> 1 ABC 2020 9 100 NA
#> 2 ABC 2020 11 200 NA
#> 3 ABC 2020 12 300 NA
#> 4 ABC 2021 1 400 200
#> 5 ABC 2021 4 500 300
#> 6 PQR 2020 10 100 NA
#> 7 PQR 2020 11 200 NA
#> 8 PQR 2020 12 300 NA
#> 9 PQR 2021 1 400 200
#> 10 PQR 2021 3 500 300
Created on 2022-07-02 by the reprex package (v2.0.1)
Try this
DF |> arrange(ID , YEAR , MONTH) |> group_by(ID) |>
mutate(`3M` = (lag(AMOUNT) + lag(AMOUNT ,2) + lag(AMOUNT , 3)) / 3)
output
# A tibble: 10 × 5
# Groups: ID [2]
ID YEAR MONTH AMOUNT `3M`
<chr> <int> <int> <int> <dbl>
1 ABC 2020 9 100 NA
2 ABC 2020 11 200 NA
3 ABC 2020 12 300 NA
4 ABC 2021 1 400 200
5 ABC 2021 4 500 300
6 PQR 2020 10 100 NA
7 PQR 2020 11 200 NA
8 PQR 2020 12 300 NA
9 PQR 2021 1 400 200
10 PQR 2021 3 500 300
I want to insert missing weeks for each household_id, channel combination so that weeks becomes in sequence. The corresponding duration column will be inserted with 0 and other columns value remains same.
Below is the dataset.
For e.g. household_id 100 and channel A: missing weeks are 37,39 and 41. I want these weeks to be inserted and duration will be 0.
But For household_id 101 and channel C: Two years are involved, 2019 and 2020. Missing are weeks 52 of 2019 and week 3 of 2020.
what I tried is below using complete function
library(tidyr)
library(dplyr)
temp <- data %>% group_by(Household_id,channel) %>%
tidyr::complete(week = seq(min(week),max(week)),fill = list(duration=0))
For Household_id 100 and channel A combination it worked fine. All weeks are now in sequence.
But for Household_id 101 and channel C it didn't worked. I want after inserting 52 week of 2019 it should go to 1st week of 2020.
I tried getting dates from week and year column thinking from exact date it may work
but not able to get that to work also.
data$date <- as.Date(paste(data$year,data$week,1,sep=""),"%Y%U%u")
Any help is greatly apprecited!
Here is the sample dataset with code:
library(dplyr)
library(tidyr)
data <- data.frame(Household_id = c(100,100,100,100,101,101,101,101,102,102),
channel = c("A","A","A","A","C","C","C","C","D","D"),
duration = c(12,34,567,67,98,23,56,89,73,76),
mala_fide_week = c(42,42,42,42,5,5,5,5,30,30),
mala_fide_year =c(2021,2021,2021,2021,2020,2020,2020,2020,2021,2021),
week =c(36,38,40,42,51,1,2,4,38,39),
year = c(2021,2021,2021,2021,2019,2020,2020,2020,2021,2021))
# imputing missing weeks and duration = 0 for each husehold channel combination
temp <- data %>% group_by(Household_id,channel) %>%
tidyr::complete(week = seq(min(week),max(week)),fill = list(duration=0))
# Getting Date from week/year if it may help
data$date <- as.Date(paste(data$year,data$week,1,sep=""),"%Y%U%u")
You can try defining the dates, making the sequence and converting to weeks. I used lubridate for ease.
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
group_by(Household_id,channel) %>%
mutate(new = paste0(year, '01-01'),
new = ymd(new) + 7 * week) %>%
complete(new = seq(min(new),max(new), by = 'week'), fill = list(duration=0)) %>%
mutate(year = replace(year, is.na(year), format(new, '%Y')[is.na(year)]),
week = week(new)) %>%
select(-new)
Household_id channel duration mala_fide_week mala_fide_year week year
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 100 A 12 42 2021 37 2021
2 100 A 0 NA NA 38 2021
3 100 A 34 42 2021 39 2021
4 100 A 0 NA NA 40 2021
5 100 A 567 42 2021 41 2021
6 100 A 0 NA NA 42 2021
7 100 A 67 42 2021 43 2021
8 101 C 98 5 2020 52 2019
9 101 C 0 NA NA 53 2019
10 101 C 0 NA NA 1 2020
11 101 C 0 NA NA 2 2020
12 101 C 0 NA NA 3 2020
13 101 C 0 NA NA 4 2020
14 102 D 73 30 2021 39 2021
15 102 D 76 30 2021 40 2021
16 101 C 23 5 2020 2 2020
17 101 C 56 5 2020 3 2020
18 101 C 89 5 2020 5 2020
I have rows grouped by ID and I want to calculate how much time passes until the next event occurs (if it does occur for that ID).
Here is example code:
year <- c(2015, 2016, 2017, 2018, 2015, 2016, 2017, 2018, 2015, 2016, 2017, 2018)
id <- c(rep("A", times = 4), rep("B", times = 4), rep("C", times = 4))
event_date <- c(NA, 2016, NA, 2018, NA, NA, NA, NA, 2015, NA, NA, 2018)
df<- as.data.frame(cbind(id, year, event_date))
df
id year event_date
1 A 2015 <NA>
2 A 2016 2016
3 A 2017 <NA>
4 A 2018 2018
5 B 2015 <NA>
6 B 2016 <NA>
7 B 2017 <NA>
8 B 2018 <NA>
9 C 2015 2015
10 C 2016 <NA>
11 C 2017 <NA>
12 C 2018 2018
Here is what I want the output to look like:
id year event_date years_till_next_event
1 A 2015 <NA> 1
2 A 2016 2016 0
3 A 2017 <NA> 1
4 A 2018 2018 0
5 B 2015 <NA> <NA>
6 B 2016 <NA> <NA>
7 B 2017 <NA> <NA>
8 B 2018 <NA> <NA>
9 C 2015 2015 0
10 C 2016 <NA> 2
11 C 2017 <NA> 1
12 C 2018 2018 0
Person B does not have the event, so it is not calculated. For the others, I want to calculate the difference between the leading event_date (ignoring NAs, if it exists) and the year.
I want to calculate years_till_next_event such that 1) if there is an event_date for a row, event_date - year. 2) If not, then return the first non-NA leading value - year. I'm having difficulty with the 2nd part of the logic, keeping in mind the event could occur not at all or every year, by ID.
Using zoo with dplyr
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(years_till_next_event = na.locf0(event_date, fromLast = TRUE) - year )
Here is a data.table option
setDT(df)[, years_till_next_event := nafill(event_date, type = "nocb") - year, id]
which gives
id year event_date years_till_next_event
1: A 2015 NA 1
2: A 2016 2016 0
3: A 2017 NA 1
4: A 2018 2018 0
5: B 2015 NA NA
6: B 2016 NA NA
7: B 2017 NA NA
8: B 2018 NA NA
9: C 2015 2015 0
10: C 2016 NA 2
11: C 2017 NA 1
12: C 2018 2018 0
You can create a new column to assign a row number within each id if the value is not NA, fill the NA values from the next values and subtract the current row number from it.
library(dplyr)
df %>%
group_by(id) %>%
mutate(years_till_next_event = replace(row_number(),is.na(event_date), NA)) %>%
tidyr::fill(years_till_next_event, .direction = 'up') %>%
mutate(years_till_next_event = years_till_next_event - row_number()) %>%
ungroup
# id year event_date years_till_next_event
# <chr> <dbl> <dbl> <int>
# 1 A 2015 NA 1
# 2 A 2016 2016 0
# 3 A 2017 NA 1
# 4 A 2018 2018 0
# 5 B 2015 NA NA
# 6 B 2016 NA NA
# 7 B 2017 NA NA
# 8 B 2018 NA NA
# 9 C 2015 2015 0
#10 C 2016 NA 2
#11 C 2017 NA 1
#12 C 2018 2018 0
data
df <- data.frame(id, year, event_date)
my data is as follows:
Year Type Amount
2013 A 100
2013 B 150
2013 C 100
2013 D 300
2014 A 200
2014 B 150
2014 C 170
2014 D 100
2014 E 120
2015 A 100
2015 B 350
2015 C 670
2015 D 300
2015 E 220
I'd like to only extract such that it gets the earliest and latest year of each type (A,B,C,D,E)
As seen, the earliest year of E starts from 2014, instead of 2013.
The output that I want will look like this:
Year Type Amount
2013 A 100
2013 B 150
2013 C 100
2013 D 300
2014 E 120
2015 A 100
2015 B 350
2015 C 670
2015 D 300
2015 E 220
Is there any way to code this, without hardcoding? This is in a dataframe format
Using dplyr you can group by Type and select Year with the condition that it is the minimum or maximum Year for each Type
library(dplyr)
df %>%
group_by(Type) %>%
filter(Year == min(Year) | Year == max(Year))
Gives us:
Year Type Amount
<int> <chr> <int>
1 2013 A 100
2 2013 B 150
3 2013 C 100
4 2013 D 300
5 2014 E 120
6 2015 A 100
7 2015 B 350
8 2015 C 670
9 2015 D 300
10 2015 E 220
For your follow up, to calculate percent increase:
df %>%
group_by(Type) %>%
filter(Year == min(Year) | Year == max(Year)) %>%
arrange(Type) %>%
mutate(pct_change = (Amount[Year == max(Year)]/Amount[Year == min(Year)] - 1)*100)
Gives us:
Year Type Amount pct_change
<int> <chr> <int> <dbl>
1 2013 A 100 0
2 2015 A 100 0
3 2013 B 150 133.
4 2015 B 350 133.
5 2013 C 100 570
6 2015 C 670 570
7 2013 D 300 0
8 2015 D 300 0
9 2014 E 120 83.3
10 2015 E 220 83.3
You can use ave testing for each Type if Year is either min or max:
x[ave(x$Year, x$Type, FUN=function(y) y==min(y) | y==max(y))==1,]
# Year Type Amount
#1 2013 A 100
#2 2013 B 150
#3 2013 C 100
#4 2013 D 300
#9 2014 E 120
#10 2015 A 100
#11 2015 B 350
#12 2015 C 670
#13 2015 D 300
#14 2015 E 220
or using range and %in%
x[ave(x$Year, x$Type, FUN=function(y) y %in% range(y))==1,]