R - Using grouping function inside a loop - r

I have a set of data with group ids. Inside each group I have to do a calculation for the first observation, and then subsequently and sequentially calculate the remaining n observations in that group. I have the following data:
library(tidyverse)
df <- tibble(id = c(1:10),
group_id = c(rep(1,5), rep(6,2),rep(8,3)),
value1 = c(100,200,300,400,500,250,350,20,25,45),
value2 = c(50,75,150,175,200,15,25,78,99,101)
)
df <- df %>%
group_by(group_id) %>%
mutate(position = 1:n()) # Creating a position id
# A tibble: 6 x 5
# Groups: group_id [2]
id group_id value1 value2 position
<int> <dbl> <dbl> <dbl> <int>
1 1 1 100 50 1
2 2 1 200 75 2
3 3 1 300 150 3
4 4 1 400 175 4
5 5 1 500 200 5
6 6 6 250 15 1
I would like to create an "aggregation" column, that uses value1, value2, and most importantly, the previous values in itself.
I first calculate the first step:
df <- df %>%
mutate(aggregation = ifelse(position == 1, value1 * value2, 0))
# A tibble: 10 x 6
# Groups: group_id [3]
id group_id value1 value2 position aggregation
<int> <dbl> <dbl> <dbl> <int> <dbl>
1 1 1 100 50 1 5000
2 2 1 200 75 2 0
3 3 1 300 150 3 0
4 4 1 400 175 4 0
5 5 1 500 200 5 0
6 6 6 250 15 1 3750
7 7 6 350 25 2 0
8 8 8 20 78 1 1560
9 9 8 25 99 2 0
10 10 8 45 101 3 0
Then I use a loop and set the condition that if the aggregation column has 0 as a value (everything not calculated in the previous step), then I use the value1 * value2 / previous aggregation value:
for (i in 1:nrow(df)) {
df$aggregation[i] <- ifelse(df$aggregation[i] == 0, round((df$value1[i] * df$value2[i]) / lag(df$aggregation)[i],0), df$aggregation[i])
}
# A tibble: 10 x 6
# Groups: group_id [3]
id group_id value1 value2 position aggregation
<int> <dbl> <dbl> <dbl> <int> <dbl>
1 1 1 100 50 1 5000
2 2 1 200 75 2 3
3 3 1 300 150 3 15000
4 4 1 400 175 4 5
5 5 1 500 200 5 20000
6 6 6 250 15 1 3750
7 7 6 350 25 2 2
8 8 8 20 78 1 1560
9 9 8 25 99 2 2
10 10 8 45 101 3 2272
I was wondering if there was a better way to do this. I like to use dplyr, but so far, due to the necessity to calculate the values one after the other, I've been unable to find a good solution.
Most importantly, however, instead of the condition I use in the last portion, I would have liked to to the following:
df %>%
group_by(group_id) %>%
mutate(aggregation = case_when(
group_id != 1 ~ value1 * value2 / lag(aggregation),
TRUE ~ aggregation
))
However, this doesn't work in a loop. I've generally been unable to use dplyr inside a loop, especially since once group_by() is used, I feel like mutate() would be my only option, but it seems to create a conflict with the functionality of the loop itself.

Perhaps you could use accumulate2 from purrr, which I think may be what you're looking for.
There are 3 arguments needed for accumulate2:
For accumulate2(), a 3-argument function. The function will be passed
the accumulated result as the first argument. The next value in
sequence from .x is passed as the second argument. The next value in
sequence from .y is passed as the third argument.
So, ..1 would be the accumulated result (the previous row aggregation value), ..2 would be value1, and ..3 would be value2.
library(tidyverse)
df %>%
group_by(group_id) %>%
mutate(aggregation = accumulate2(value1[-1], value2[-1],
~ round(..2 * ..3 / ..1),
.init = value1[1] * value2[1]) %>% flatten_dbl)
Output
id group_id value1 value2 position aggregation
<int> <dbl> <dbl> <dbl> <int> <dbl>
1 1 1 100 50 1 5000
2 2 1 200 75 2 3
3 3 1 300 150 3 15000
4 4 1 400 175 4 5
5 5 1 500 200 5 20000
6 6 6 250 15 1 3750
7 7 6 350 25 2 2
8 8 8 20 78 1 1560
9 9 8 25 99 2 2
10 10 8 45 101 3 2272

Related

Find discontinuities in observational data with R

Data
id<-c("a","a","a","a","a","a","b","b","b","b","b","b")
d<-c(1,2,3,90,98,100000,4,6,7,8,23,45)
df<-data.frame(id,d)
I want to detect observational discontinuities of each "id".
My expected result is obtain a way to detect discontinuities without using means or medians as a reference.
You can check whether the difference between a row and the next one within each group is different than 1:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dis = +(c(F, diff(d) != 1)))
# A tibble: 12 × 3
# Groups: id [2]
id d dis
<chr> <dbl> <int>
1 a 1 0
2 a 2 0
3 a 3 0
4 a 90 1
5 a 98 1
6 a 100000 1
7 b 4 0
8 b 6 1
9 b 7 0
10 b 8 0
11 b 23 1
12 b 45 1

reset a ranking when a variable exceeds a value using dplyr

Suppose I have the following data:
df <- tibble(ID=c(1,2,3,4,5,6,7,8,9,10),
ID2=c(1,1,1,1,2,2,2,3,4,4),
VAR=c(25,10,120,60,85,90,20,40,60,150))
I want to add a new column with a ranking that would be reset either when the ID2 changes or when VAR is greater than 100.
The desired result is:
# A tibble: 10 x 4
ID ID2 VAR RANK
<dbl> <dbl> <dbl> <dbl>
1 1 1 25 1
2 2 1 10 2
3 3 1 120 1
4 4 1 60 2
5 5 2 85 1
6 6 2 90 2
7 7 2 20 3
8 8 3 40 1
9 9 4 60 1
10 10 4 150 1
I know how to add a new column with a ranking that would be reset only when the ID2 changes:
df %>%
arrange(ID2) %>%
group_by(ID2) %>%
mutate(RANK = row_number())
... but treating both conditions at the same time is more difficult. How should I do using dplyr?
You can group_by ID2 and cumsum(VAR > 100), i.e.:
library(dplyr)
df %>%
group_by(ID2, cumVAR = cumsum(VAR > 100)) %>%
mutate(RANK = row_number())
output
# A tibble: 10 x 5
# Groups: ID2, cumVAR [6]
ID ID2 VAR cumVAR RANK
<dbl> <dbl> <dbl> <int> <int>
1 1 1 25 0 1
2 2 1 10 0 2
3 3 1 120 1 1
4 4 1 60 1 2
5 5 2 85 1 1
6 6 2 90 1 2
7 7 2 20 1 3
8 8 3 40 1 1
9 9 4 60 1 1
10 10 4 150 2 1
rowid from data.table would be useful as well
library(dplyr)
library(data.table)
df %>%
mutate(RANK = rowid(ID2, cumsum(VAR > 100)))
-output
# A tibble: 10 × 4
ID ID2 VAR RANK
<dbl> <dbl> <dbl> <int>
1 1 1 25 1
2 2 1 10 2
3 3 1 120 1
4 4 1 60 2
5 5 2 85 1
6 6 2 90 2
7 7 2 20 3
8 8 3 40 1
9 9 4 60 1
10 10 4 150 1

Ignore zeros and NAs in cumsum

I need to assign numbers to sets of consecutive values in every column and create new columns. Eventually I want to find a sum of values in z column that correspond to the first consecutive numbers in each column.
My data looks something like this:
library(dplyr)
y1 = c(1,2,3,8,9,0)
y2 = c(0,0,0,4,5,6)
z = c(200,250,200,100,90,80)
yabc <- tibble(y1, y2, z)
# A tibble: 6 × 3
y1 y2 z
<dbl> <dbl> <dbl>
1 1 0 200
2 2 0 250
3 3 0 200
4 8 4 100
5 9 5 90
6 0 6 80
I tried the following formula:
yabc %>%
mutate_at(vars(starts_with("y")),
list(mod = ~ cumsum(c(FALSE, diff(.x)!=1))+1))
that gave me the following result:
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 200 1 1
2 2 0 250 1 2
3 3 0 200 1 3
4 8 4 100 2 4
5 9 5 90 2 4
6 0 6 80 3 4
I am only interested in numbers greater than zero. I tried replacing zeros with NA, but it did not work either.
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 200 1 1
2 2 NA 250 1 NA
3 3 NA 200 1 NA
4 8 4 100 2 NA
5 9 5 90 2 NA
6 NA 6 80 NA NA
What I would like the data to look like is:
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 200 1 NA
2 2 0 250 1 NA
3 3 0 200 1 NA
4 8 4 100 2 1
5 9 5 90 2 1
6 0 6 80 NA 1
Is there any way to exclude zeros and start applying the formula only when .x is greater than 0? Or any other way to make the formula work the way I need? Thank you!
FYI: mutate_at has been superseded by across, I'll demonstrate the new method in my code.
yabc %>%
mutate(
across(starts_with("y"),
list(mod = ~ if_else(.x > 0,
cumsum(.x > 0 & c(FALSE, diff(.x) != 1)) + 1L,
NA_integer_) )
)
)
# # A tibble: 6 x 5
# y1 y2 z y1_mod y2_mod
# <dbl> <dbl> <dbl> <int> <int>
# 1 1 0 200 1 NA
# 2 2 0 250 1 NA
# 3 3 0 200 1 NA
# 4 8 4 100 2 2
# 5 9 5 90 2 2
# 6 0 6 80 NA 2
If this is sufficient (you don't care if it's 1 or 2 for the first effective group in y2_mod), then you're good. If you want to reduce them all to be 1-based, then
yabc %>%
mutate(
across(starts_with("y"),
list(mod = ~ if_else(.x > 0,
cumsum(.x > 0 & c(FALSE, diff(.x) != 1)),
NA_integer_))),
across(ends_with("_mod"),
~ if_else(is.na(.x), .x, match(.x, na.omit(unique(.x))))
)
)
# # A tibble: 6 x 5
# y1 y2 z y1_mod y2_mod
# <dbl> <dbl> <dbl> <int> <int>
# 1 1 0 200 1 NA
# 2 2 0 250 1 NA
# 3 3 0 200 1 NA
# 4 8 4 100 2 1
# 5 9 5 90 2 1
# 6 0 6 80 NA 1
Notes:
if_else is helpful to handle the NA-including rows specially; it requires the same class, which can be annoying/confusing. Because of this, we need to pass the specific "class" of NA as the false= (third) argument to if_else. For example, cumsum(.)+1 produces a numeric, so the third arg would need to be NA_real_ (since the default NA is actually logical). Another way to deal with it is to either use cumsum(.)+1L (produces an integer) and NA_integer_ or (as I show in my second example) use cumsum(.) by itself (and NA_integer_) since we match things later (and match(.) returns integer)
I demo the shift from your mutate_at to mutate(across(..)). An important change here from mutate is that we run across without assigning its return to anything. In essence, it returns a named-list where each element of the list is an updated column or a new one, depending on the presence of .names; that takes a glue-like string to allow for renaming the calculated columns, thereby adding new columns instead of the default action (no .names) of overwriting the columns in-place. The alternate way of producing new (not in-place) columns is the way you used, with a named list of functions, still a common/supported way to use a list of functions within across(..).
library(data.table)
library(tidyverse)
yabc %>%
mutate(across(starts_with('y'),
~ as.integer(factor(`is.na<-`(rleid(.x - row_number()), !.x))),
.names = '{col}_mod'))
# A tibble: 6 x 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <int> <int>
1 1 0 200 1 NA
2 2 0 250 1 NA
3 3 0 200 1 NA
4 8 4 100 2 1
5 9 5 90 2 1
6 0 6 80 NA 1
The trick lies in knowing that for consecutive numbers, the difference between the number and their row_number() is the same:
ie consider:
x <- c(1,2,3,6,7,8,10,11,12)
The consecutive numbers can be grouped as:
x - seq_along(x)
[1] 0 0 0 2 2 2 3 3 3
As you can see, the consecutive numbers are grouped together. To get the desired groups, we should use rle
rleid(x-seq_along(x))
[1] 1 1 1 2 2 2 3 3 3
Another possible solution:
library(tidyverse)
y1=c(1,2,3,8,9,0)
y2=c(0,0,0,4,5,6)
z=c(200,250,200,100,90,80)
yabc<-tibble(y1,y2,z)
yabc %>%
mutate(across(starts_with("y"),
~if_else(.x==0, NA_real_, 1+cumsum(c(1,diff(.x)) != 1)), .names="{.col}_mod"))%>%
mutate(across(ends_with("mod"), ~ factor(.x) %>% as.numeric(.)))
#> # A tibble: 6 × 5
#> y1 y2 z y1_mod y2_mod
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 200 1 NA
#> 2 2 0 250 1 NA
#> 3 3 0 200 1 NA
#> 4 8 4 100 2 1
#> 5 9 5 90 2 1
#> 6 0 6 80 NA 1

Optimize computation in dplyr mutate function

Assume following table:
library(dplyr)
library(tibble)
library(purrr)
df = tibble(
client = c(1,1,1,1,2,2,2,2),
prod_type = c(1,1,2,2,1,1,2,2),
max_prod_type = c(2,2,2,2,2,2,2,2),
value_1 = c(10,20,30,30,100,200,300,300),
value_2 = c(1,2,3,3,1,2,3,3),
)
# A tibble: 8 x 5
client prod_type max_prod_type value_1 value_2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1
2 1 1 2 20 2
3 1 2 2 30 3
4 1 2 2 30 3
5 2 1 2 100 1
6 2 1 2 200 2
7 2 2 2 300 3
8 2 2 2 300 3
Column 'max_prod_type' here denotes maximum value for 'prod_type' column per each 'client' value. I need to compute new column 'sum', which would contain sum from adding the values from 'value_1' and 'value_2', but only for those rows, where 'prod_type' == 'max_prod_type' per each 'client' value.
I have tried following code:
df %>%
mutate(
sum =
map2_dbl(
client, max_prod_type,
~case_when(
prod_type == .y~
filter(df, client == .x, prod_type == .y) %>%
mutate(sum = value_1 + value_2) %>%
select(sum) %>%
sum(),
T~NA_real_
)
)
)
Desired output is following:
# A tibble: 8 x 6
client prod_type max_prod_type value_1 value_2 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 66
4 1 2 2 30 3 66
5 2 1 2 100 1 NA
6 2 1 2 200 2 NA
7 2 2 2 300 3 606
8 2 2 2 300 3 606
But it throws an error:
Error: Problem with `mutate()` input `sum`.
x Result 1 must be a single double, not a double vector of length 6
i Input `sum` is `map2_dbl(...)`.
Moreover, as for me such way of implementation is somewhat slow. I'm wondering if there any correct and more optimized solution to this problem.
Appreciate your help!
One option could be:
df %>%
group_by(client) %>%
mutate(res = row_number() == which(value_1 == max(value_1)),
res = if_else(res, sum(value_1[res]) + sum(value_2[res]), NA_real_))
client prod_type max_prod_type value_1 value_2 res
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 66
4 1 2 2 30 3 66
5 2 1 2 100 1 NA
6 2 1 2 200 2 NA
7 2 2 2 300 3 606
8 2 2 2 300 3 606
I think this is closer to what you want:
df %>%
mutate(sum = case_when(prod_type == max_prod_type ~ value_1 + value_2,
TRUE ~ NA_real_))
# A tibble: 6 x 6
client prod_type max_prod_type value_1 value_2 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 33
4 2 1 2 100 1 NA
5 2 1 2 200 2 NA
6 2 2 2 300 3 303

Group Data in R for consecutive rows

If there's not a quick 1-3 liner for this in R, I'll definitely just use linux sort and a short python program using groupby, so don't bend over backwards trying to get something crazy working. Here's the input data frame:
df_in <- data.frame(
ID = c(1,1,1,1,1,2,2,2,2,2),
weight = c(150,150,151,150,150,170,170,170,171,171),
start_day = c(1,4,7,10,11,5,10,15,20,25),
end_day = c(4,7,10,11,30,10,15,20,25,30)
)
ID weight start_day end_day
1 1 150 1 4
2 1 150 4 7
3 1 151 7 10
4 1 150 10 11
5 1 150 11 30
6 2 170 5 10
7 2 170 10 15
8 2 170 15 20
9 2 171 20 25
10 2 171 25 30
I would like to do some basic aggregation by ID and weight, but only when the group is in consecutive rows of df_in. Specifically, the desired output is
df_desired_out <- data.frame(
ID = c(1,1,1,2,2),
weight = c(150,151,150,170,171),
min_day = c(1,7,10,5,20),
max_day = c(7,10,30,20,30)
)
ID weight min_day max_day
1 1 150 1 7
2 1 151 7 10
3 1 150 10 30
4 2 170 5 20
5 2 171 20 30
This question seems to be extremely close to what I want, but I'm having lots of trouble adapting it for some reason.
In dplyr, I would do this by creating another grouping variable for the consecutive rows. This is what the code cumsum(c(1, diff(weight) != 0) is doing in the code chunk below. An example of this is also here.
The group creation can be done within group_by, and then you can proceed accordingly with making any summaries by group.
library(dplyr)
df_in %>%
group_by(ID, group_weight = cumsum(c(1, diff(weight) != 0)), weight) %>%
summarise(start_day = min(start_day), end_day = max(end_day))
Source: local data frame [5 x 5]
Groups: ID, group_weight [?]
ID group_weight weight start_day end_day
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1 150 1 7
2 1 2 151 7 10
3 1 3 150 10 30
4 2 4 170 5 20
5 2 5 171 20 30
This approach does leave you with the extra grouping variable in the dataset, which can be removed, if needed, with select(-group_weight) after ungrouping.
First we combine ID and weight. The quick-and-dirty way is using paste:
df_in$id_weight <- paste(df_in$id, df_in$weight, sep='_')
df_in
ID weight start_day end_day id_weight
1 1 150 1 4 1_150
2 1 150 4 7 1_150
3 1 151 7 10 1_151
4 1 150 10 11 1_150
5 1 150 11 30 1_150
6 2 170 5 10 2_170
7 2 170 10 15 2_170
8 2 170 15 20 2_170
9 2 171 20 25 2_171
10 2 171 25 30 2_171
Safer way is to use interaction or group_indices: Combine values in 4 columns to a single unique value
We can group consecutively using rle.
rlel <- rle(df_in$id_weight)$lengths
df_in$group <- unlist(lapply(1:length(rlel), function(i) rep(i, rlel[i])))
df_in
ID weight start_day end_day id_weight group
1 1 150 1 4 1_150 1
2 1 150 4 7 1_150 1
3 1 151 7 10 1_151 2
4 1 150 10 11 1_150 3
5 1 150 11 30 1_150 3
6 2 170 5 10 2_170 4
7 2 170 10 15 2_170 4
8 2 170 15 20 2_170 4
9 2 171 20 25 2_171 5
10 2 171 25 30 2_171 5
Now with the convenient group number we can summarize by group.
df_in %>%
group_by(group) %>%
summarize(id_weight = id_weight[1],
start_day = min(start_day),
end_day = max(end_day))
# A tibble: 5 x 4
group id_weight start_day end_day
<int> <chr> <dbl> <dbl>
1 1 1_150 1 7
2 2 1_151 7 10
3 3 1_150 10 30
4 4 2_170 5 20
5 5 2_171 20 30
with(df_in, {
aggregate(day, list('ID'=ID, 'weight'=weight),
function(x) c('min_day' = min(x), 'max_day' = max(x)))
})
Produces:
ID weight x.min_day x.max_day
1 1 150 1 5
2 1 151 3 3
3 2 170 1 3
4 2 171 4 5

Resources