Conditional cumulative sum from two columns - r

I can't get my head around the following problem.
Assuming the follwoing data:
library(tidyverse)
df <- tibble(source = c("A", "A", "B", "B", "B", "C"),
value = c(5, 10, NA, NA, NA, 20),
add = c(1, 1, 1, 2, 3, 4))
What I want to do is: for all cases where source == "B", I want to calculate the cumulative sum of the previous row's value and the current row's add. Of course, for the first "B" row, I need to provide a starting value for value. Note: in this case, it would be fine if we just take the value from the last "A" row.
So for row 3, the result would be 10 + 1 = 11.
For row 4, the result would be 11 + 2 = 13.
For row 5, the results would be 13 + 3 = 16.
I tried to use purrr::accumulate, but I failed in many different ways, e.g. I thought I can do:
df %>%
mutate(test = accumulate(add, .init = 10, ~.x + .y))
But this leads to error:
Error: Problem with `mutate()` column `test`.
i `test = accumulate(add, .init = 10, ~.x + .y)`.
i `test` must be size 6 or 1, not 7.
Same if I use .init = value
And I also didn't manage to do the job only on group B (although this is probably no issue, I think I can probably performa on the full data frame and then just replace values for all non-B rows).
Expected output:
# A tibble: 6 x 4
source value add test
<chr> <dbl> <dbl> <dbl>
1 A 5 1 NA
2 A 10 1 NA
3 B NA 1 11
4 B NA 2 13
5 B NA 3 16
6 C 20 4 NA

You were essentially in the right direction. Since you provide an .init value to accumulate, the resulting vector is of size n+1, with the first value being .init. You have to remove the first value to get a vector that fit to your column size.
Then, if you want NAs on the remaining values, here's a way to do it. Also, since the "starting row" is the third, .init has to be set to 8.
df %>%
mutate(test =
ifelse(source == "B", accumulate(add, .init = 8, ~.x + .y)[-1], NA))
# A tibble: 6 x 4
source value add test
<chr> <dbl> <dbl> <dbl>
1 A 5 1 NA
2 A 10 1 NA
3 B NA 1 11
4 B NA 2 13
5 B NA 3 16
6 C 20 4 NA

#tmfmnk provided an awesome answer and they deserve full credit (NOT ME)
Below is the same code from their comment (for more visibility, while also setting an initial value)
init_value = 10
df = df %>%
mutate(test = lag(value)) %>%
group_by(source) %>%
mutate(test = init_value + cumsum(add))

Related

Proportional substraction from multiple variables

I have two variables that change over time. Addition is known, but as for substraction, I know only the sum that needs to be substracted, while I want to substract it from both variables proportionally.
Here is an example of a dataset
df = data.frame(id = c(rep(1,5), rep(2,3)),
ord = c(1:5, 1:3),
a = c(10, NA, 20, 0, NA, 0, 15, NA),
b = c(0, NA, 0, 15, NA, 10, 0, NA),
substract = c(NA, -5, NA, NA, -10, NA, NA, -15)) %>%
rowwise() %>%
mutate(all = sum(c(a, b, substract), na.rm = TRUE)) %>%
arrange(id, ord) %>%
group_by(id) %>%
mutate(all = cumsum(all)) %>%
ungroup()
So, I want to replace NA in a and b with the value from substract, multiplied by cumulative sum of a and b respectively, divided by value in all right before the NA. The problem is, that after each replacement, the next replacement should take into account all the previous replacements, as cumulative sum of a and b will change after that.
I have a solution with a while loop that works, but is highly ineffective. The original dataset is huge, so it is not an option for me, but it might give some additional insight on what I would like to achieve.
test = df %>%
group_by(id)
while(any(is.na(test$a))){
test = test %>%
mutate(across(c("a", "b"), ~ ifelse(is.na(.x), lag(cumsum(.x)) / lag(all) * substract, .x)))
}
Could anyone suggest a more effective solution? Like, if there was any way to make mutate function save changes after each mutation (so it does not need to be put into a while loop) or something?
EDIT: user63230 suggested using recursive functions. It does seem to be what I wanted, but I still have difficulties in their application to my case. accumulate2() accepts only 3-arguments function and does not seem to work with lag() (as I need not only the previous value of the current variable), so it seems to be not enough. Maybe there is a way to make it work, but I have not discovered it yet. Any help would be very appreciated.
Using a similar approach as here, I think this would work, although not pretty:
library(dplyr)
sp <- split(df, df$id)
list_of_dfs <- lapply(sp, function(x){
for(i in which(is.na(x$a))){
tmp <- x[seq_len(i), ]
x$a[i] <- tail(cumsum(tmp$a)[!is.na(cumsum(tmp$a))], 1)/tail(dplyr::lag(tmp$all), 1)*tail((tmp$substract), 1)
}
x
})
bind_rows(list_of_dfs)
# id ord a b substract all
# <dbl> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 10 0 NA 10
# 2 1 2 -5 NA -5 5
# 3 1 3 20 0 NA 25
# 4 1 4 0 15 NA 40
# 5 1 5 -6.25 NA -10 30
# 6 2 1 0 10 NA 10
# 7 2 2 15 0 NA 25
# 8 2 3 -9 NA -15 10
Can be repeated/automated for b if suitable?

Setting missing values using labelled package across multiple columns?

I am using the labelled package and trying to set user-defined missing values. I have a dataframe where I want to set missing values for a list of specific columns rather than the entire dataset.
Currently I have to type out each column (s2 and s3). Is there a more efficient way? My full dataset has dozens of columns.
df <- tibble(s1 = c(1, 2, 3, 9), s2 = c(1, 1, 2, 9), s3 = c(1, 1, 2, 9))
df <- df %>%
set_na_values(., s2 = 9) %>%
set_na_values(., s3 = 9)
na_values(df$s1)
na_values(df$s2)
na_values(df$s3)
The set_na_values() function takes multiple pairs so you don't need to call it more than once:
library(labelled)
library(dplyr)
df %>%
set_na_values(s2 = 9, s3 = 9)
If you were dealing with a lot of variables you could programatically build a named vector or list (if there are multiple missing values per variable) and splice it inside the function. If, from your comment you wanted to apply it to everything except the s1 variable, you can do:
nm <- setdiff(names(df), "s1")
df %>%
set_na_values(!!!setNames(rep(9, length(nm)), nm))
# A tibble: 4 x 3
s1 s2 s3
<dbl> <dbl+lbl> <dbl+lbl>
1 1 1 1
2 2 1 1
3 3 2 2
4 9 9 (NA) 9 (NA)
Alternatively, you can use labelled_spss() and take advantage of across() which allows tidyselect semantics (but this will overwrite any existing labelled values):
df %>%
mutate(across(-s1, labelled_spss, na_values = 9))
# A tibble: 4 x 3
s1 s2 s3
<dbl> <dbl+lbl> <dbl+lbl>
1 1 1 1
2 2 1 1
3 3 2 2
4 9 9 (NA) 9 (NA)
To reset any existing values use:
df %>%
mutate(across(-s1, ~ labelled_spss(.x, labels = val_labels(.x), na_values = 9)))

Rolling sum of one variable in data.frame in number of steps defined by another variable

I'm trying to sum up the values in a data.frame in a cumulative way.
I have this:
df <- data.frame(
a = rep(1:2, each = 5),
b = 1:10,
step_window = c(2,3,1,2,4, 1,2,3,2,1)
)
I'm trying to sum up the values of b, within the groups a. The trick is, I want the sum of b values that corresponds to the number of rows following the current row given by step_window.
This is the output I'm looking for:
data.frame(
a = rep(1:2, each = 5),
step_window = c(2,3,1,2,4,
1,2,3,2,1),
b = 1:10,
sum_b_step_window = c(3, 9, 3, 9, 5,
6, 15, 27, 19, 10)
)
I tried to do this using the RcppRoll but I get an error Expecting a single value:
df %>%
group_by(a) %>%
mutate(sum_b_step_window = RcppRoll::roll_sum(x = b, n = step_window))
I'm not sure if having variable window size is possible in any of the rolling function. Here is one way to do this using map2_dbl :
library(dplyr)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = purrr::map2_dbl(row_number(), step_window,
~sum(b[.x:(.x + .y - 1)], na.rm = TRUE)))
# a b step_window sum_b_step_window
# <int> <int> <dbl> <dbl>
# 1 1 1 2 3
# 2 1 2 3 9
# 3 1 3 1 3
# 4 1 4 2 9
# 5 1 5 4 5
# 6 2 6 1 6
# 7 2 7 2 15
# 8 2 8 3 27
# 9 2 9 2 19
#10 2 10 1 10
1) rollapply
rollapply in zoo supports vector widths. partial=TRUE says that if the width goes past the end then use just the values within the data. (Another possibility would be to use fill=NA instead in which case it would fill with NA's if there were not enough data left) . align="left" specifies that the current value at each step is the left end of the range to sum.
library(dplyr)
library(zoo)
df %>%
group_by(a) %>%
mutate(sum = rollapply(b, step_window, sum, partial = TRUE, align = "left")) %>%
ungroup
2) SQL
This can also be done in SQL by left joining df to itself on the indicated condition and then for each row summing over all rows for which the condition matches.
library(sqldf)
sqldf("select A.*, sum(B.b) as sum
from df A
left join df B on B.rowid between A.rowid and A.rowid + A.step_window - 1
and A.a = B.a
group by A.rowid")
Here is a solution with the package slider.
library(dplyr)
library(slider)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = hop_vec(b, row_number(), step_window+row_number()-1, sum)) %>%
ungroup()
It is flexible on different window sizes.
Output:
# A tibble: 10 x 4
a b step_window sum_b_step_window
<int> <int> <dbl> <int>
1 1 1 2 3
2 1 2 3 9
3 1 3 1 3
4 1 4 2 9
5 1 5 4 5
6 2 6 1 6
7 2 7 2 15
8 2 8 3 27
9 2 9 2 19
10 2 10 1 10
slider is a couple-of-months-old tidyverse package specific for sliding window functions. Have a look here for more info: page, vignette
hop is the engine of slider. With this solution we are triggering different .start and .stop to sum the values of b according to the a groups.
With _vec you're asking hop to return a vector: a double in this case.
row_number() is a dplyr function that allows you to return the row number of each group, thus allowing you to slide along the rows.
data.table solution using cumulative sums
setDT(df)
df[, sum_b_step_window := {
cs <- c(0,cumsum(b))
cs[pmin(.N+1, 1:.N+step_window)]-cs[pmax(1, (1:.N))]
},by = a]

Ignore value conditionally within group_by in dplyr

Please consider the following.
Background
In a data.frame I have patient IDs (id), the day at which patients are admitted to a hospital (day), a code for the diagnostic activity they received that day (code), a price for that activity (price) and a frequency for that activity (freq).
Activities with code b and c are registered at the same time but mean more or less the same thing and should not be double counted.
Problem
What I want is: if code "b" and "c" are registered for the same day, code "b" should be ignored.
The example data.frame looks like this:
x <- data.frame(id = c(rep("a", 4), rep("b", 3)),
day = c(1, 1, 1, 2, 1, 2, 3),
price = c(500, 10, 100, rep(10, 3), 100),
code = c("a", "b", "c", rep("b", 3), "c"),
freq = c(rep(1, 5), rep(2, 2))))
> x
id day price code freq
1 a 1 500 a 1
2 a 1 10 b 1
3 a 1 100 c 1
4 a 2 10 b 1
5 b 1 10 b 1
6 b 2 10 b 2
7 b 3 100 c 2
So the costs for patient "a" for day 1 would be 600 and not 610 as I can compute with the following:
x %>%
group_by(id, day) %>%
summarise(res = sum(price * freq))
# A tibble: 5 x 3
# Groups: id [?]
id day res
<fct> <dbl> <dbl>
1 a 1. 610.
2 a 2. 10.
3 b 1. 10.
4 b 2. 20.
5 b 3. 200.
Possible approaches
Either I delete observation code "b" when "c" is present on that same day or I set freq of code "b" to 0 in case code "c" is present on the same day.
All my attempts with ifelse and mutate failed so far.
Every help is much appreciated. Thank you very much in advance!
You can add a filter line to remove the offending b values like this...
x %>%
group_by(id, day) %>%
filter(!(code=="b" & "c" %in% code)) %>%
summarise(res = sum(price * freq))
id day res
<fct> <dbl> <dbl>
1 a 1. 600.
2 a 2. 10.
3 b 1. 10.
4 b 2. 20.
5 b 3. 200.
You could create a new column like this:
mutate(code_day = paste0(ifelse(code %in% c("b", "c"), "z", code), day)
Then all your Bs and Cs will become Zs (without losing the original code column that helps you tell them apart). You can then arrange by code descending and remove duplicate values in the code_day column:
arrange(desc(code)) %>% # Bs will come after Cs
distinct(code_day, .keep_all = TRUE)

Extracting corresponding dataframe values from multiple records using a function

I have a dataframe (df1) containing many records Each record has up to three trials, each trial can be repeat up to five times. Below is an example of some data I have:
Record Trial Start End Speed Number
1 2 1 4 12 9
1 2 4 6 11 10
1 3 1 3 10 17
2 1 1 5 14 5
I have the following code that calculates the longest 'Distance' and 'Maximum Number' for each Record.:
getInfo <- function(race_df) {
race_distance <- as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.distance = max(End - Start)))
race_max_number = as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.N = max(Number)))
rd_rmn_merge <- as.data.frame(merge(x = race_distance, y = race_max_number)
total_summary <- as.data.frame(rd_rmn_merge[order(rd_rmn_merge$trial,])
return(list(race_distance, race_max_number, total_summary)
}
list_summary <- getInfo(race_df)
total_summary <- list_of_races[[3]]
list_summary gives me an output like this:
[[1]]
Record Trial Max.Distance
1 2 3
1 3 2
2 1 4
[[2]]
Record Trial Max.Number
1 2 10
1 3 17
2 1 5
[[3]]
Record Trial Max.Distance Max.Number
1 2 3 10
1 3 2 17
2 1 4 5
I am now trying to seek the longest distance with the corresponding 'Number' regardless if it being maximum. So having Record 1, Trial 2 look like this instead:
Record Trial Max.Distance Corresponding Number
1 2 3 9
Eventually I would like to be able to create a function that is able to take arguments 'Record' and 'Trial' through the 'race_df' dataframe to make searching for a specific record and trial's longest distance easier.
Any help on this would be much appreciated.
The data (in case anyone else wants to offer their solution):
df <- data.frame( Record = c(1,1,1,2),
Trial = c(2,2,3,1),
Start = c(1,4,1,1),
End = c(4,6,3,5),
Speed = c(12,11,10,14),
Number = c(9,10,17,5))
Here's a tidyverse solution:
library(tidyverse)
df %>%
mutate( Max.Distance = End - Start) %>%
select(-Start,-End,-Speed) %>%
group_by(Record) %>%
nest() %>%
mutate( data = map( data, ~ filter(.x, Max.Distance == max(Max.Distance)) )) %>%
unnest()
The output:
Record Trial Number Max.Distance
<dbl> <dbl> <dbl> <dbl>
1 1 2 9 3
2 2 1 5 4
Note if you want to keep all of your columns in the final data frame, just remove select....
I hope I get right what your function is supposed to do. In the end it should take a record and a trial and put out the row(s) where we have the maximum distance, right?
So, it boils down to two filters:
filter rows for the record and trial.
filter the row inside that subset that has the maximum distance
Between those two filters, we have to calculate the distance although I suggest you move that outside the function because it is basically a one time operation.
race_df <- data.frame(Record = c(1, 1, 1, 2), Trial = c(2, 2, 3, 1),
Start = c(1, 4, 1, 1), End = c(4, 6, 3, 5), Speed = c(12, 11, 10, 14),
Number = c(9, 10, 17, 5))
get_longest <- function(df, record, trial){
df %>%
filter(Record == record & Trial == trial) %>%
mutate(Distance = End - Start) %>%
filter(Distance == max(Distance)) %>%
select(Number, Distance)
}
get_longest(race_df, 1, 2)

Resources