Group Data in R for consecutive rows - r

If there's not a quick 1-3 liner for this in R, I'll definitely just use linux sort and a short python program using groupby, so don't bend over backwards trying to get something crazy working. Here's the input data frame:
df_in <- data.frame(
ID = c(1,1,1,1,1,2,2,2,2,2),
weight = c(150,150,151,150,150,170,170,170,171,171),
start_day = c(1,4,7,10,11,5,10,15,20,25),
end_day = c(4,7,10,11,30,10,15,20,25,30)
)
ID weight start_day end_day
1 1 150 1 4
2 1 150 4 7
3 1 151 7 10
4 1 150 10 11
5 1 150 11 30
6 2 170 5 10
7 2 170 10 15
8 2 170 15 20
9 2 171 20 25
10 2 171 25 30
I would like to do some basic aggregation by ID and weight, but only when the group is in consecutive rows of df_in. Specifically, the desired output is
df_desired_out <- data.frame(
ID = c(1,1,1,2,2),
weight = c(150,151,150,170,171),
min_day = c(1,7,10,5,20),
max_day = c(7,10,30,20,30)
)
ID weight min_day max_day
1 1 150 1 7
2 1 151 7 10
3 1 150 10 30
4 2 170 5 20
5 2 171 20 30
This question seems to be extremely close to what I want, but I'm having lots of trouble adapting it for some reason.

In dplyr, I would do this by creating another grouping variable for the consecutive rows. This is what the code cumsum(c(1, diff(weight) != 0) is doing in the code chunk below. An example of this is also here.
The group creation can be done within group_by, and then you can proceed accordingly with making any summaries by group.
library(dplyr)
df_in %>%
group_by(ID, group_weight = cumsum(c(1, diff(weight) != 0)), weight) %>%
summarise(start_day = min(start_day), end_day = max(end_day))
Source: local data frame [5 x 5]
Groups: ID, group_weight [?]
ID group_weight weight start_day end_day
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1 150 1 7
2 1 2 151 7 10
3 1 3 150 10 30
4 2 4 170 5 20
5 2 5 171 20 30
This approach does leave you with the extra grouping variable in the dataset, which can be removed, if needed, with select(-group_weight) after ungrouping.

First we combine ID and weight. The quick-and-dirty way is using paste:
df_in$id_weight <- paste(df_in$id, df_in$weight, sep='_')
df_in
ID weight start_day end_day id_weight
1 1 150 1 4 1_150
2 1 150 4 7 1_150
3 1 151 7 10 1_151
4 1 150 10 11 1_150
5 1 150 11 30 1_150
6 2 170 5 10 2_170
7 2 170 10 15 2_170
8 2 170 15 20 2_170
9 2 171 20 25 2_171
10 2 171 25 30 2_171
Safer way is to use interaction or group_indices: Combine values in 4 columns to a single unique value
We can group consecutively using rle.
rlel <- rle(df_in$id_weight)$lengths
df_in$group <- unlist(lapply(1:length(rlel), function(i) rep(i, rlel[i])))
df_in
ID weight start_day end_day id_weight group
1 1 150 1 4 1_150 1
2 1 150 4 7 1_150 1
3 1 151 7 10 1_151 2
4 1 150 10 11 1_150 3
5 1 150 11 30 1_150 3
6 2 170 5 10 2_170 4
7 2 170 10 15 2_170 4
8 2 170 15 20 2_170 4
9 2 171 20 25 2_171 5
10 2 171 25 30 2_171 5
Now with the convenient group number we can summarize by group.
df_in %>%
group_by(group) %>%
summarize(id_weight = id_weight[1],
start_day = min(start_day),
end_day = max(end_day))
# A tibble: 5 x 4
group id_weight start_day end_day
<int> <chr> <dbl> <dbl>
1 1 1_150 1 7
2 2 1_151 7 10
3 3 1_150 10 30
4 4 2_170 5 20
5 5 2_171 20 30

with(df_in, {
aggregate(day, list('ID'=ID, 'weight'=weight),
function(x) c('min_day' = min(x), 'max_day' = max(x)))
})
Produces:
ID weight x.min_day x.max_day
1 1 150 1 5
2 1 151 3 3
3 2 170 1 3
4 2 171 4 5

Related

R - Using grouping function inside a loop

I have a set of data with group ids. Inside each group I have to do a calculation for the first observation, and then subsequently and sequentially calculate the remaining n observations in that group. I have the following data:
library(tidyverse)
df <- tibble(id = c(1:10),
group_id = c(rep(1,5), rep(6,2),rep(8,3)),
value1 = c(100,200,300,400,500,250,350,20,25,45),
value2 = c(50,75,150,175,200,15,25,78,99,101)
)
df <- df %>%
group_by(group_id) %>%
mutate(position = 1:n()) # Creating a position id
# A tibble: 6 x 5
# Groups: group_id [2]
id group_id value1 value2 position
<int> <dbl> <dbl> <dbl> <int>
1 1 1 100 50 1
2 2 1 200 75 2
3 3 1 300 150 3
4 4 1 400 175 4
5 5 1 500 200 5
6 6 6 250 15 1
I would like to create an "aggregation" column, that uses value1, value2, and most importantly, the previous values in itself.
I first calculate the first step:
df <- df %>%
mutate(aggregation = ifelse(position == 1, value1 * value2, 0))
# A tibble: 10 x 6
# Groups: group_id [3]
id group_id value1 value2 position aggregation
<int> <dbl> <dbl> <dbl> <int> <dbl>
1 1 1 100 50 1 5000
2 2 1 200 75 2 0
3 3 1 300 150 3 0
4 4 1 400 175 4 0
5 5 1 500 200 5 0
6 6 6 250 15 1 3750
7 7 6 350 25 2 0
8 8 8 20 78 1 1560
9 9 8 25 99 2 0
10 10 8 45 101 3 0
Then I use a loop and set the condition that if the aggregation column has 0 as a value (everything not calculated in the previous step), then I use the value1 * value2 / previous aggregation value:
for (i in 1:nrow(df)) {
df$aggregation[i] <- ifelse(df$aggregation[i] == 0, round((df$value1[i] * df$value2[i]) / lag(df$aggregation)[i],0), df$aggregation[i])
}
# A tibble: 10 x 6
# Groups: group_id [3]
id group_id value1 value2 position aggregation
<int> <dbl> <dbl> <dbl> <int> <dbl>
1 1 1 100 50 1 5000
2 2 1 200 75 2 3
3 3 1 300 150 3 15000
4 4 1 400 175 4 5
5 5 1 500 200 5 20000
6 6 6 250 15 1 3750
7 7 6 350 25 2 2
8 8 8 20 78 1 1560
9 9 8 25 99 2 2
10 10 8 45 101 3 2272
I was wondering if there was a better way to do this. I like to use dplyr, but so far, due to the necessity to calculate the values one after the other, I've been unable to find a good solution.
Most importantly, however, instead of the condition I use in the last portion, I would have liked to to the following:
df %>%
group_by(group_id) %>%
mutate(aggregation = case_when(
group_id != 1 ~ value1 * value2 / lag(aggregation),
TRUE ~ aggregation
))
However, this doesn't work in a loop. I've generally been unable to use dplyr inside a loop, especially since once group_by() is used, I feel like mutate() would be my only option, but it seems to create a conflict with the functionality of the loop itself.
Perhaps you could use accumulate2 from purrr, which I think may be what you're looking for.
There are 3 arguments needed for accumulate2:
For accumulate2(), a 3-argument function. The function will be passed
the accumulated result as the first argument. The next value in
sequence from .x is passed as the second argument. The next value in
sequence from .y is passed as the third argument.
So, ..1 would be the accumulated result (the previous row aggregation value), ..2 would be value1, and ..3 would be value2.
library(tidyverse)
df %>%
group_by(group_id) %>%
mutate(aggregation = accumulate2(value1[-1], value2[-1],
~ round(..2 * ..3 / ..1),
.init = value1[1] * value2[1]) %>% flatten_dbl)
Output
id group_id value1 value2 position aggregation
<int> <dbl> <dbl> <dbl> <int> <dbl>
1 1 1 100 50 1 5000
2 2 1 200 75 2 3
3 3 1 300 150 3 15000
4 4 1 400 175 4 5
5 5 1 500 200 5 20000
6 6 6 250 15 1 3750
7 7 6 350 25 2 2
8 8 8 20 78 1 1560
9 9 8 25 99 2 2
10 10 8 45 101 3 2272

Finding cumulative second max per group in R

I have a dataset where I would like to create a new variable that is the cumulative second largest value of another variable, and I would like to perform this function per group.
Let's say I create the following example data frame:
(df1 <- data.frame(patient = rep(1:5, each=8), visit = rep(1:2,each=4,5), trial = rep(1:4,10), var1 = sample(1:50,20,replace=TRUE)))
This is pretend data that represents 5 patients who each had 2 study visits, and each visit had 4 trials with a measurement taken (var1).
> head(df1,n=20)
patient visit trial var1
1 1 1 1 25
2 1 1 2 23
3 1 1 3 48
4 1 1 4 37
5 1 2 1 41
6 1 2 2 45
7 1 2 3 8
8 1 2 4 9
9 2 1 1 26
10 2 1 2 14
11 2 1 3 41
12 2 1 4 35
13 2 2 1 37
14 2 2 2 30
15 2 2 3 14
16 2 2 4 28
17 3 1 1 34
18 3 1 2 19
19 3 1 3 28
20 3 1 4 10
I would like to create a new variable, cum2ndmax, that is the cumulative 2nd largest value of var1 and I would like to group this variable by patient # and visit #.
I figured out how to calculate the cumulative 2nd max number like so:
df1$cum2ndmax <- sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]})
df1
However, this calculates the cumulative 2nd max across the whole dataset, not for each group. I have attempted to calculate this variable using grouped data like so after installing and loading package dplyr:
library(dplyr)
df2 <- df1 %>%
group_by(patient,visit) %>%
mutate(cum2ndmax = sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]}))
But I get an error: Error: Problem with mutate() input cum2ndmax. x Input cum2ndmax can't be recycled to size 4.
Ideally, my result would look something like this:
patient visit trial var1 cum2ndmax
1 1 1 25 NA
1 1 2 23 23
1 1 3 48 25
1 1 4 37 37
1 2 1 41 NA
1 2 2 45 41
1 2 3 8 41
1 2 4 9 41
2 1 1 26 NA
2 1 2 14 14
2 1 3 41 26
2 1 4 35 35
… … … … …
Any help in getting this to work in R would be much appreciated! Thank you!
One dplyr and purrr option could be:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[dense_rank(-var1[1:.x]) == 2])))
patient visit trial var1 cum_second_max
<int> <int> <int> <int> <dbl>
1 1 1 1 25 NA
2 1 1 2 23 23
3 1 1 3 48 25
4 1 1 4 37 37
5 1 2 1 41 NA
6 1 2 2 45 41
7 1 2 3 8 41
8 1 2 4 9 41
9 2 1 1 26 NA
10 2 1 2 14 14
11 2 1 3 41 26
12 2 1 4 35 35
13 2 2 1 37 NA
14 2 2 2 30 30
15 2 2 3 14 30
16 2 2 4 28 30
17 3 1 1 34 NA
18 3 1 2 19 19
19 3 1 3 28 28
20 3 1 4 10 28
Here is an Rcpp solution.
cum_second_max is a modification of cummax which keeps track of the second maximum.
library(tidyverse)
Rcpp::cppFunction("
NumericVector cum_second_max(NumericVector x) {
double max_value = R_NegInf, max_value2 = NA_REAL;
NumericVector result(x.length());
for (int i = 0 ; i < x.length() ; ++i) {
if (x[i] > max_value) {
max_value2 = max_value;
max_value = x[i];
}
else if (x[i] < max_value && x[i] > max_value2) {
max_value2 = x[i];
}
result[i] = isinf(max_value2) ? NA_REAL : max_value2;
}
return result;
}
")
df1 %>%
group_by(patient, visit) %>%
mutate(
c2max = cum_second_max(var1)
)
#> # A tibble: 20 x 5
#> # Groups: patient, visit [5]
#> patient visit trial var1 c2max
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 25 NA
#> 2 1 1 2 23 23
#> 3 1 1 3 48 25
#> 4 1 1 4 37 37
#> 5 1 2 1 41 NA
#> 6 1 2 2 45 41
#> 7 1 2 3 8 41
#> 8 1 2 4 9 41
#> 9 2 1 1 26 NA
#> 10 2 1 2 14 14
#> 11 2 1 3 41 26
#> 12 2 1 4 35 35
#> 13 2 2 1 37 NA
#> 14 2 2 2 30 30
#> 15 2 2 3 14 30
#> 16 2 2 4 28 30
#> 17 3 1 1 34 NA
#> 18 3 1 2 19 19
#> 19 3 1 3 28 28
#> 20 3 1 4 10 28
Thanks so much everyone! I really appreciate it and could not have solved this without your help! In the end, I ended up using a similar approach suggested by tmfmnk since I was already using dplyr. I found an interesting result with the code suggested by tmkmnk where for some reason it gave me a column of values that just repeated the first row's number. With a small tweak to change dense_rank to order, I got exactly what I wanted like this:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[order(-var1[1:.x])[2])))

How can I create a lag difference variable within group relative to baseline?

I would like a variable that is a lagged difference to the within group baseline. I have panel data that I have balanced.
my_data <- data.frame(id = c(1,1,1,2,2,2,3,3,3), group = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
id group score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I would like it to look like this:
id group score lag_diff_baseline
1 1 1 0 NA
2 1 2 150 150
3 1 3 170 170
4 2 1 80 NA
5 2 2 100 20
6 2 3 110 30
7 3 1 75 NA
8 3 2 100 25
9 3 3 0 -75
The data.table version of #Liam's answer
library(data.table)
setDT(my_data)
my_data[,.(id,group,score,lag_diff_baseline = score-first(score)),by = id]
I missed the easy answer:
library(dplyr)
my_data %>%
group_by(id) %>%
mutate(lag_diff_baseline = score - first(score))

FIll zeros with previous value +1

I have a record grouped by users. At the variable "day" there are some 0s, which I would like to have replaced in order of sequence (= previous value +1).
data <- data.frame(user = c(1,1,1,2,2,2,2,2), day = c(170,0,172,34,35,0,0,38))
data
user day
1 1 170
2 1 0
3 1 172
4 2 34
5 2 35
6 2 0
7 2 0
8 2 38
I want to have the following:
data_new
user day
1 1 170
2 1 171
3 1 172
4 2 34
5 2 35
6 2 36
7 2 37
8 2 38
I've tried the following (really inefficient and doesn't work for all cases...):
data = group_by(data, user) %>%
+ mutate(lead_day = lead(day),
+ day_new = case_when(day == 0 ~ lead_day - 1,
+ day > 0 ~ day))
> data
# A tibble: 8 x 4
# Groups: user [2]
user day lead_day day_new
<dbl> <dbl> <dbl> <dbl>
1 1 170 0 170
2 1 0 172 171
3 1 172 NA 172
4 2 34 35 34
5 2 35 0 35
6 2 0 0 -1
7 2 0 38 37
8 2 38 NA 38
You could use Reduce
data$day <-Reduce(function(x,y) if(y==0) x+1 else y, data$day,accumulate = TRUE)
data
# user day
# 1 1 170
# 2 1 171
# 3 1 172
# 4 2 34
# 5 2 35
# 6 2 36
# 7 2 37
# 8 2 38
Or as you use tidyverse already :
data %>% mutate(day = accumulate(day,~if(.y==0) .x+1 else .y))
# user day
# 1 1 170
# 2 1 171
# 3 1 172
# 4 2 34
# 5 2 35
# 6 2 36
# 7 2 37
# 8 2 38

R Dplyr; calculating difference between two columns from previous row but putting result in next row without for loop

I am trying to solve the following problem in which I am looking to calculate the difference between two columns from the previous row on the next row using dplyr in R, preferably without the use of a loop. In this specific example, I want to calculate r_j - s_j from the previous row but then paste the result in the next row.
Here is some sample data:
require(tidyverse)
data = tibble(LM = c(100, 300, 400, 500, 600, 700, 800, 1300), s_j = c(2,2,2,1,2,2,1,1)) %>%
bind_cols(,r_j = rep(25, nrow(.))
LM s_j r_j
1 100 2 25
2 300 2 25
3 400 2 25
4 500 1 25
5 600 2 25
6 700 2 25
7 800 1 25
8 1300 1 25
My desired output is this;
LM s_j r_j
1 100 2 25
2 300 2 23
3 400 2 21
4 500 1 19
5 600 2 18
6 700 2 16
7 800 1 14
8 1300 1 13
A solution to this problem is:
for (k in 2:nrow(data)){
tmp = data$r_j[k-1] - data$s_j[k-1]
data$r_j[k] = tmp
}
which yields
LM s_j r_j
1 100 2 25
2 300 2 23
3 400 2 21
4 500 1 19
5 600 2 18
6 700 2 16
7 800 1 14
8 1300 1 13
but surely there exists a much better solution than the for loop in R? Thanks for any help.
One way is to generate the cumulative sum of s_j and then subtract that from r_j
data %>% mutate(
temp = cumsum(s_j),
r_j2 = r_j-temp
)
# A tibble: 8 x 5
LM s_j r_j temp r_j2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 100 2 25 2 23
2 300 2 25 4 21
3 400 2 25 6 19
4 500 1 25 7 18
5 600 2 25 9 16
6 700 2 25 11 14
7 800 1 25 12 13
8 1300 1 25 13 12
EDIT: To generate the exact output desired one can subtract the value of s_j from its cumsum and get the following:
data %>% mutate(
temp = cumsum(s_j)-s_j,
r_j2 = r_j-temp
)
# A tibble: 8 x 5
LM s_j r_j temp r_j2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 100 2 25 0 25
2 300 2 25 2 23
3 400 2 25 4 21
4 500 1 25 6 19
5 600 2 25 7 18
6 700 2 25 9 16
7 800 1 25 11 14
8 1300 1 25 12 13
EDIT2: Including the solution by IceCreamToucan which does not need to generate a temp column:
data %>% mutate(
r_j2 = coalesce(lag(r_j - cumsum(s_j)), r_j)
)
# A tibble: 8 x 4
LM s_j r_j r_j2
<dbl> <dbl> <dbl> <dbl>
1 100 2 25 25
2 300 2 25 23
3 400 2 25 21
4 500 1 25 19
5 600 2 25 18
6 700 2 25 16
7 800 1 25 14
8 1300 1 25 13

Resources