Fill a column down, recursively applying a function to the previous value - r

Consider this data
data <- data.frame(
group = c(rep("A", 10), rep("B", 10)),
year = rep(2016:2025, 2),
value = c(10, 20, 30, 40, NA, NA, NA, NA, NA, NA,
70, 80, NA, NA, NA, NA, NA, NA, NA, NA)
)
data
#> group year value
#> 1 A 2016 10
#> 2 A 2017 20
#> 3 A 2018 30
#> 4 A 2019 40
#> 5 A 2020 NA
#> 6 A 2021 NA
#> 7 A 2022 NA
#> 8 A 2023 NA
#> 9 A 2024 NA
#> 10 A 2025 NA
#> 11 B 2016 70
#> 12 B 2017 80
#> 13 B 2018 NA
#> 14 B 2019 NA
#> 15 B 2020 NA
#> 16 B 2021 NA
#> 17 B 2022 NA
#> 18 B 2023 NA
#> 19 B 2024 NA
#> 20 B 2025 NA
Now I’d like to fill in the missing values applying an arbitrary function
to the last non-missing value, recursively.
For example, let’s say that I want to
let the values increase by 10 points, yearly.
So, for rows where value is not NA, it should remain unmodified.
Starting from the row where value is NA, it applies that valuet = valuet − 1 + 10
A naive attempt to do it would be to use dplyr::lag,
but this only works for the first missing value because lag is vectorized and
operates on the value vector and do not recurse over the previous values
library(dplyr)
data |>
group_by(group) |>
mutate(value_fix = dplyr::lag(value) + 10)
#> # A tibble: 20 × 4
#> # Groups: group [2]
#> group year value value_fix
#> <chr> <int> <dbl> <dbl>
#> 1 A 2016 10 NA
#> 2 A 2017 20 20
#> 3 A 2018 30 30
#> 4 A 2019 40 40
#> 5 A 2020 NA 50
#> 6 A 2021 NA NA
#> 7 A 2022 NA NA
#> 8 A 2023 NA NA
#> 9 A 2024 NA NA
#> 10 A 2025 NA NA
#> 11 B 2016 70 NA
#> 12 B 2017 80 80
#> 13 B 2018 NA 90
#> 14 B 2019 NA NA
#> 15 B 2020 NA NA
#> 16 B 2021 NA NA
#> 17 B 2022 NA NA
#> 18 B 2023 NA NA
#> 19 B 2024 NA NA
#> 20 B 2025 NA NA
This is where I thought tidyr could help, because it is somewhat similar
to tidyr::fill
data |>
group_by(group) |>
tidyr::fill(value)
but ideally, with an .f argument to be applied recursively to the last value.
There does not seem to be something like that.
Googling around I came up with this solution
data |>
group_by(group) |>
mutate(last_value = case_when(
value == dplyr::last(na.omit(value)) ~ value,
TRUE ~ NA_real_
)) |>
mutate(value_fix = purrr::accumulate(
.x = last_value,
.f = ~ coalesce(.x + 10, .y)
))
#> # A tibble: 20 × 5
#> # Groups: group [2]
#> group year value last_value value_fix
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A 2016 10 NA NA
#> 2 A 2017 20 NA NA
#> 3 A 2018 30 NA NA
#> 4 A 2019 40 40 40
#> 5 A 2020 NA NA 50
#> 6 A 2021 NA NA 60
#> 7 A 2022 NA NA 70
#> 8 A 2023 NA NA 80
#> 9 A 2024 NA NA 90
#> 10 A 2025 NA NA 100
#> 11 B 2016 70 NA NA
#> 12 B 2017 80 80 80
#> 13 B 2018 NA NA 90
#> 14 B 2019 NA NA 100
#> 15 B 2020 NA NA 110
#> 16 B 2021 NA NA 120
#> 17 B 2022 NA NA 130
#> 18 B 2023 NA NA 140
#> 19 B 2024 NA NA 150
#> 20 B 2025 NA NA 160
Which works, but seems kind of hacky and not easy to read. It would be cleaner to just write the loop and be happy.
I really thought for such a simple case, there would be a built-in way (vectorized, readable-code) to do it in the tidyverse. But I could not find any. Am I missing something?, any ideas how to better do this?
Created on 2022-08-30 with reprex v2.0.2

custom_fun <- function(x, y) {
if(is.na(y)) x + 10 else y
}
data %>%
group_by(group)%>%
mutate(value = accumulate(value, custom_fun))
# Groups: group [2]
group year value
<chr> <int> <dbl>
1 A 2016 10
2 A 2017 20
3 A 2018 30
4 A 2019 40
5 A 2020 50
6 A 2021 60
7 A 2022 70
8 A 2023 80
9 A 2024 90
10 A 2025 100
11 B 2016 70
12 B 2017 80
13 B 2018 90
14 B 2019 100
15 B 2020 110
16 B 2021 120
17 B 2022 130
18 B 2023 140
19 B 2024 150
20 B 2025 160

A base solution with ave() + Reduce(accumulate = TRUE):
transform(data, value = ave(value, group, FUN = \(val) {
Reduce(\(x, y) if(is.na(y)) x + 10 else y, val, accumulate = TRUE)
}))
group year value
1 A 2016 10
2 A 2017 20
3 A 2018 30
4 A 2019 40
5 A 2020 50
6 A 2021 60
7 A 2022 70
8 A 2023 80
9 A 2024 90
10 A 2025 100
11 B 2016 70
12 B 2017 80
13 B 2018 90
14 B 2019 100
15 B 2020 110
16 B 2021 120
17 B 2022 130
18 B 2023 140
19 B 2024 150
20 B 2025 160

You could also write a vectorized function:
fun <- function(x){
idx <- is.na(x)
b <- rle(idx)
id2 <- cumsum(b$lengths)[!b$values]
x[idx] <- sequence(b$lengths[b$values], x[id2] + 10, by=10)
x
}
transform(data, value = fun(value))
group year value
1 A 2016 10
2 A 2017 20
3 A 2018 30
4 A 2019 40
5 A 2020 50
6 A 2021 60
7 A 2022 70
8 A 2023 80
9 A 2024 90
10 A 2025 100
11 B 2016 70
12 B 2017 80
13 B 2018 90
14 B 2019 100
15 B 2020 110
16 B 2021 120
17 B 2022 130
18 B 2023 140
19 B 2024 150
20 B 2025 160

Related

Moving average by multiple group

I have a following DF (demo). I would like to find the previous 3 month moving average of Amount column per ID, Year and Month.
ID YEAR MONTH AMOUNT
1 ABC 2020 09 100
2 ABC 2020 11 200
3 ABC 2020 12 300
4 ABC 2021 01 400
5 ABC 2021 04 500
6 PQR 2020 10 100
7 PQR 2020 11 200
8 PQR 2020 12 300
9 PQR 2021 01 400
10 PQR 2021 03 500
Following is an attempt.
library(TTR)
library(dplyr)
DF %>% group_by(ID, YEAR, MONTH) %>% mutate(3MA = runMean(AMOUNT, 3))
resulting in error with n=3 is outside valid range.
Desired Output:
ID YEAR MONTH AMOUNT 3MA
1 ABC 2020 09 100 NA
2 ABC 2020 11 200 NA
3 ABC 2020 12 300 NA
4 ABC 2021 01 400 200 (100+200+300)/3
5 ABC 2021 04 500 300 (400+300+200)/3
6 PQR 2020 10 100 NA
7 PQR 2020 11 200 NA
8 PQR 2020 12 300 NA
9 PQR 2021 01 400 200 (100+200+300)/3
10 PQR 2021 03 500 300 (400+300+200)/3
You can use the following code:
library(dplyr)
arrange(DF,ID,YEAR) %>%
group_by(ID) %>%
mutate(lag1=lag(AMOUNT),
lag2=lag(AMOUNT,2),
lag3=lag(AMOUNT,3),
movave=(lag1+lag2+lag3)/3)
#> # A tibble: 10 × 8
#> # Groups: ID [2]
#> ID YEAR MONTH AMOUNT lag1 lag2 lag3 movave
#> <chr> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 ABC 2020 9 100 NA NA NA NA
#> 2 ABC 2020 11 200 100 NA NA NA
#> 3 ABC 2020 12 300 200 100 NA NA
#> 4 ABC 2021 1 400 300 200 100 200
#> 5 ABC 2021 4 500 400 300 200 300
#> 6 PQR 2020 10 100 NA NA NA NA
#> 7 PQR 2020 11 200 100 NA NA NA
#> 8 PQR 2020 12 300 200 100 NA NA
#> 9 PQR 2021 1 400 300 200 100 200
#> 10 PQR 2021 3 500 400 300 200 300
Created on 2022-07-02 by the reprex package (v2.0.1)
An option using a sliding window:
library(tidyverse)
library(slider)
df <- tribble(
~id, ~year, ~month, ~amount,
"ABC", 2020, 09, 100,
"ABC", 2020, 11, 200,
"ABC", 2020, 12, 300,
"ABC", 2021, 01, 400,
"ABC", 2021, 04, 500,
"PQR", 2020, 10, 100,
"PQR", 2020, 11, 200,
"PQR", 2020, 12, 300,
"PQR", 2021, 01, 400,
"PQR", 2021, 03, 500
)
df |>
arrange(id, year, month) |>
group_by(id) |>
mutate(ma3 = slide_dbl(lag(amount), mean, .before = 2, complete = TRUE)) |>
ungroup() # if needed
#> # A tibble: 10 × 5
#> id year month amount ma3
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC 2020 9 100 NA
#> 2 ABC 2020 11 200 NA
#> 3 ABC 2020 12 300 NA
#> 4 ABC 2021 1 400 200
#> 5 ABC 2021 4 500 300
#> 6 PQR 2020 10 100 NA
#> 7 PQR 2020 11 200 NA
#> 8 PQR 2020 12 300 NA
#> 9 PQR 2021 1 400 200
#> 10 PQR 2021 3 500 300
Created on 2022-07-02 by the reprex package (v2.0.1)
Here is a way.
suppressPackageStartupMessages({
library(dplyr)
library(TTR)
})
x <- ' ID YEAR MONTH AMOUNT
1 ABC 2020 09 100
2 ABC 2020 11 200
3 ABC 2020 12 300
4 ABC 2021 01 400
5 ABC 2021 04 500
6 PQR 2020 10 100
7 PQR 2020 11 200
8 PQR 2020 12 300
9 PQR 2021 01 400
10 PQR 2021 03 500 '
DF <- read.table(textConnection(x), header = TRUE)
DF %>%
arrange(ID, YEAR, MONTH) %>%
group_by(ID) %>%
mutate(`3MA` = lag(runMean(AMOUNT, 3)))
#> # A tibble: 10 × 5
#> # Groups: ID [2]
#> ID YEAR MONTH AMOUNT `3MA`
#> <chr> <int> <int> <int> <dbl>
#> 1 ABC 2020 9 100 NA
#> 2 ABC 2020 11 200 NA
#> 3 ABC 2020 12 300 NA
#> 4 ABC 2021 1 400 200
#> 5 ABC 2021 4 500 300
#> 6 PQR 2020 10 100 NA
#> 7 PQR 2020 11 200 NA
#> 8 PQR 2020 12 300 NA
#> 9 PQR 2021 1 400 200
#> 10 PQR 2021 3 500 300
Created on 2022-07-02 by the reprex package (v2.0.1)
Try this
DF |> arrange(ID , YEAR , MONTH) |> group_by(ID) |>
mutate(`3M` = (lag(AMOUNT) + lag(AMOUNT ,2) + lag(AMOUNT , 3)) / 3)
output
# A tibble: 10 × 5
# Groups: ID [2]
ID YEAR MONTH AMOUNT `3M`
<chr> <int> <int> <int> <dbl>
1 ABC 2020 9 100 NA
2 ABC 2020 11 200 NA
3 ABC 2020 12 300 NA
4 ABC 2021 1 400 200
5 ABC 2021 4 500 300
6 PQR 2020 10 100 NA
7 PQR 2020 11 200 NA
8 PQR 2020 12 300 NA
9 PQR 2021 1 400 200
10 PQR 2021 3 500 300

How to add a new column with values specific to grouped variables

I'm new to R and have found similar solutions to my problem, but I'm struggling to apply these to my code. Please help...
These data are simplified, as the id variables are many:
df = data.frame(id = rep(c("a_10", "a_11", "b_10", "b_11"), each = 5),
site = rep(1:5, 4),
value = sample(1:20))
The aim is to add another column labelled "year" with values that are grouped by "id" but the true names are many - so I'm trying to simplify the code by using the ending digits.
I can use dplyr to split the dataframe into each id variable using this code (repeated for each id variable):
df %>%
select(site, id, value) %>%
filter(grepl("10$", id)) %>%
mutate(Year = "2010")`
Rather than using merge to re-combine the dataframes back into one, is there not a more simple method?
I tried modifying case_when with mutate as described in a previous answer:
[https://stackoverflow.com/a/63043920/12313457][1]
mutate(year = case_when(grepl(c("10$", "11$", id) == c("2010", "2011"))))
is something like this possible??
Thanks in advance
In case your id column has different string lengths you can use sub:
df %>%
mutate(Year = paste0("20", sub('^.*_(\\d+)$', '\\1', id)))
#> id site value Year
#> 1 a_10 1 2 2010
#> 2 a_10 2 7 2010
#> 3 a_10 3 16 2010
#> 4 a_10 4 10 2010
#> 5 a_10 5 11 2010
#> 6 a_11 1 5 2011
#> 7 a_11 2 13 2011
#> 8 a_11 3 14 2011
#> 9 a_11 4 6 2011
#> 10 a_11 5 12 2011
#> 11 b_10 1 17 2010
#> 12 b_10 2 1 2010
#> 13 b_10 3 4 2010
#> 14 b_10 4 15 2010
#> 15 b_10 5 9 2010
#> 16 b_11 1 8 2011
#> 17 b_11 2 20 2011
#> 18 b_11 3 19 2011
#> 19 b_11 4 18 2011
#> 20 b_11 5 3 2011
Created on 2022-04-21 by the reprex package (v2.0.1)
You can use substr to get the final two digits of id and then paste0 this to "20" to recreate the year.
df |> dplyr::mutate(Year = paste0("20", substr(id, 3, 4)))
#> id site value Year
#> 1 a_10 1 5 2010
#> 2 a_10 2 12 2010
#> 3 a_10 3 9 2010
#> 4 a_10 4 7 2010
#> 5 a_10 5 13 2010
#> 6 a_11 1 3 2011
#> 7 a_11 2 4 2011
#> 8 a_11 3 16 2011
#> 9 a_11 4 2 2011
#> 10 a_11 5 6 2011
#> 11 b_10 1 19 2010
#> 12 b_10 2 14 2010
#> 13 b_10 3 15 2010
#> 14 b_10 4 10 2010
#> 15 b_10 5 11 2010
#> 16 b_11 1 18 2011
#> 17 b_11 2 1 2011
#> 18 b_11 3 20 2011
#> 19 b_11 4 17 2011
#> 20 b_11 5 8 2011
Created on 2022-04-21 by the reprex package (v2.0.1)

R: Pivot_Wider/spread by obtaining average sorted by year

I've the following dataset
Pet Shop
Year
Item
Price
A
2021
dog
300
A
2021
dog
250
A
2021
fish
20
A
2020
turtle
50
A
2020
dog
250
A
2020
cat
280
A
2019
rabbit
180
A
2019
cat
165
A
2019
cat
270
B
2021
dog
350
B
2021
fish
80
B
2021
fish
70
B
2020
cat
220
B
2020
turtle
90
B
2020
turtle
80
B
2020
fish
55
B
2019
fish
75
C
2021
dog
280
C
2020
cat
260
C
2020
cat
270
C
2019
fish
65
C
2019
cat
270
The code for the data is as follows
Pet_Shop = c(rep("A",9), rep("B",8), rep("C",5))
Item = c("Dog","Dog","Fish","Turtle","Dog","Cat","Rabbit","Cat","Cat","Dog","Fish","Fish","Cat","Turtle","Turtle","Fish","Fish","Dog","Cat","Cat","Fish","Cat")
Price = c(300,250,20,50,250,280,180,165,270,350,80,70,220,90,80,55,75,280,260,270,65,270)
Data = data.frame(Pet_Shop, Item, Price)
Does anyone here know how I can use pivot_wider or spread (or any other method) to achieve the following table? It groups the Shop by year and takes the average of the similar item of the same shop for the year. I've issues incorporating the year.
Pet Shop
Year
dog
fish
turtle
cat
rabbit
A
2021
Average(300,250) = 275
20
NA
NA
NA
A
2020
250
NA
50
280
NA
A
2019
NA
NA
NA
217.5
NA
B
2021
350
75
NA
NA
NA
B
2020
NA
55
85
220
NA
B
2019
NA
75
NA
NA
NA
C
2021
280
NA
NA
NA
NA
C
2020
NA
NA
NA
265
NA
C
2019
NA
60
NA
270
NA
In pivot_wider you may pass a function (values_fn) to be applied to each combination of Pet_Shop and Year.
result <- tidyr::pivot_wider(Data, names_from = Item,
values_from = Price, values_fn = mean)
result
# Pet_Shop Year dog fish turtle cat rabbit
# <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 2021 275 20 NA NA NA
#2 A 2020 250 NA 50 280 NA
#3 A 2019 NA NA NA 218. 180
#4 B 2021 350 75 NA NA NA
#5 B 2020 NA 55 85 220 NA
#6 B 2019 NA 75 NA NA NA
#7 C 2021 280 NA NA NA NA
#8 C 2020 NA NA NA 265 NA
#9 C 2019 NA 65 NA 270 NA
The same can also be done with data.table dcast -
library(data.table)
dcast(setDT(Data), Pet_Shop + Year ~ Item,
value.var = "Price", fun.aggregate = mean)

How to delete missing observations for a subset of columns: the R equivalent of dropna(subset) from python pandas

Consider a dataframe in R where I want to drop row 6 because it has missing observations for the variables var1:var3. But the dataframe has valid observations for id and year. See code below.
In python, this can be done in two ways:
use df.dropna(subset = ['var1', 'var2', 'var3'], inplace=True)
use df.set_index(['id', 'year']).dropna()
How to do this in R with tidyverse?
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
We may use complete.cases
library(dplyr)
df %>%
filter(if_any(var1:var3, complete.cases))
-output
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 48 55 82
2 2 2002 22 83 67
3 3 2003 89 NA 19
4 4 2004 56 1 38
5 5 2005 17 58 35
6 7 2007 4 30 94
7 8 2008 NA NA 36
8 9 2009 97 100 80
9 10 2010 37 NA NA
We can use pmap for this case also:
library(dplyr)
library(purrr)
df %>%
filter(!pmap_lgl(., ~ {x <- c(...)[-c(1, 2)];
all(is.na(x))}))
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 90 55 77
2 2 2002 77 5 18
3 3 2003 17 NA 70
4 4 2004 72 33 33
5 5 2005 10 55 77
6 7 2007 22 81 17
7 8 2008 NA NA 46
8 9 2009 93 28 100
9 10 2010 50 NA NA
Or we could also use complete.cases function in pmap as suggested by dear #akrun:
df %>%
filter(pmap_lgl(select(., 3:5), ~ any(complete.cases(c(...)))))
You can use if_any in filter -
library(dplyr)
df %>% filter(if_any(var1:var3, Negate(is.na)))
# id year var1 var2 var3
# <int> <int> <int> <int> <int>
#1 1 2001 14 99 43
#2 2 2002 25 72 76
#3 3 2003 90 NA 15
#4 4 2004 91 7 32
#5 5 2005 69 42 7
#6 7 2007 57 83 41
#7 8 2008 NA NA 74
#8 9 2009 9 78 23
#9 10 2010 93 NA NA
In base R, we can use rowSums to select rows which has atleast 1 non-NA value.
cols <- grep('var', names(df))
df[rowSums(!is.na(df[cols])) > 0, ]
If looking for complete cases, use the following (kernel of this is based on other answers):
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
df %>% filter(!if_any(var1:var3, is.na))
#> # A tibble: 6 x 5
#> id year var1 var2 var3
#> <int> <int> <int> <int> <int>
#> 1 1 2001 13 28 26
#> 2 2 2002 61 77 58
#> 3 4 2004 95 38 58
#> 4 5 2005 38 34 91
#> 5 7 2007 85 46 14
#> 6 9 2009 45 60 40
Created on 2021-06-24 by the reprex package (v2.0.0)

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

Resources