grouped event chain ID in tidyverse - r

I'm attempting to create an ID column for my data frame that counts a sequence of events and can't figure out where I'm going wrong.
The data looks like this:
data
library(tidyverse)
df <- tribble(
~group, ~value,
"a", 4,
"a", 3,
"a", 10,
"b", 2,
"b", 4,
"a", 20,
"a", 14,
"a", 12,
"a", 9,
"b", 66,
"b", 23,
"b", 48)
Things I've tried...
I tried to use cur_group_id() but that only seems to return a binary value recognizing each group:
df %>%
group_by(group) %>%
mutate(ID = cur_group_id()) %>%
as.data.frame()
# A tibble: 12 x 3
group value expectedID
<chr> <dbl> <dbl>
1 a 4 1
2 a 3 1
3 a 10 1
4 b 2 1
5 b 4 1
6 a 20 2
7 a 14 2
8 a 12 2
9 a 9 2
10 b 66 2
11 b 23 2
12 b 48 2
I've also tried seq_along() which gets me a bit closer to what I want, but is more a running count of the rows, like row_number(), for each time the group has a value.
df %>%
group_by(group) %>%
mutate(ID = seq_along(group)) %>%
as.data.frame()
group value expectedID ID
1 a 4 1 1
2 a 3 1 2
3 a 10 1 3
4 b 2 1 1
5 b 4 1 2
6 a 20 2 4
7 a 14 2 5
8 a 12 2 6
9 a 9 2 7
10 b 66 2 3
11 b 23 2 4
12 b 48 2 5
My desired output
What I'd really like it to look like is this:
df$expectedID <- c(1,1,1,1,1,2,2,2,2,2,2,2)
# A tibble: 12 x 3
group value expectedID
<chr> <dbl> <dbl>
1 a 4 1
2 a 3 1
3 a 10 1
4 b 2 1
5 b 4 1
6 a 20 2
7 a 14 2
8 a 12 2
9 a 9 2
10 b 66 2
11 b 23 2
12 b 48 2
Basically, if the lagged group is the same as the current group, retain the count. If the lagged group is different than the current group, begin a new count. Each time the group changes, increase the count by one.

Here is one option, (ab)using rle() with data.table::rowid():
df$id <-
rle(df$group) %>% {rep(data.table::rowid(.$values), times = .$length)}

Related

In R: How to extract a specific (e.g. last) value from a dataframe with multiple rows belonging to one person?

I have a dataframe that contains variables from a collection of follow-up visits from patients after transplantation.
For sake of simplicity the number of variables are 2: 1) the patient identification, 2) the number of days from the transplantation up until the follow-up event.
Each row is a follow-up visit. There can be multiple follow-ups for each patient in the data frame. The amount of follow-ups for each patient vary. The days after transplantation when these follow-up visits happen vary as well.
I would like to extract the number of days from the last follow up of each patient and write it in a separate column in every follow-up observation of the patient.
In the real dataset the amount of patients is around 15,000. I tried to extract the values from a nested dataframe, but it was not possible for me.
Example:
patient_ID <- c("A", "A", "A", "A", "B", "B", "C", "C", "C")
days_tx_followup <- c(0, 5, 10, 15, 2, 4, 1, 2, 3)
df <- data.frame(patient_ID, days_tx_followup)
patient_ID days_tx_followup
1 A 0
2 A 5
3 A 10
4 A 15
5 B 2
6 B 4
7 C 1
8 C 2
9 C 3
What I would like to have:
patient_ID days_tx_followup last_followup
1 A 0 15
2 A 5 15
3 A 10 15
4 A 15 15
5 B 2 4
6 B 4 4
7 C 1 3
8 C 2 3
9 C 3 3
Thankfully dplyr has a function called last that can do just this.
df %>%
group_by(patient_ID) %>%
mutate(
last_followup = last(days_tx_followup)
)
#> # A tibble: 9 × 3
#> # Groups: patient_ID [3]
#> patient_ID days_tx_followup last_followup
#> <chr> <dbl> <dbl>
#> 1 A 0 15
#> 2 A 5 15
#> 3 A 10 15
#> 4 A 15 15
#> 5 B 2 4
#> 6 B 4 4
#> 7 C 1 3
#> 8 C 2 3
#> 9 C 3 3
Created on 2022-08-23 by the reprex package (v2.0.1)
Using by and tail.
by(df, df$patient_ID, \(x) cbind(x, last_followup=tail(x$days_tx_followup, 1))) |> unsplit(df$patient_ID)
# patient_ID days_tx_followup last_followup
# 1 A 0 15
# 2 A 5 15
# 3 A 10 15
# 4 A 15 15
# 5 B 2 4
# 6 B 4 4
# 7 C 1 3
# 8 C 2 3
# 9 C 3 3

Replace all subsequent column values, after the first instance of a value greater than x

I have a dataframe (df1) with two columns, one (grp) is a grouping variable, the second (num) has some measurements.
For each group I want to:
replace all numbers greater than 3.5 with 4
replace all numbers after the first instance of 4 with 4
I just want to get to step 2, but step 1 seems like a logical starting point, maybe it isn't required though?
Example data
library(dplyr)
df1 <- data.frame(
grp = rep(c("a", "b"), each = 10),
num = c(0,1,2,5,0,1,7,0,2,1,2,2,2,2,5,0,0,0,0,6))
I can get the first part:
df1 %>%
group_by(grp) %>%
mutate(num = ifelse(num > 3.5, 4, num))
For the second part I tried using dplyr::lag and dplyr::case_when but no luck. Here is the desired output:
grp num
1 a 0
2 a 1
3 a 2
4 a 4
5 a 4
6 a 4
7 a 4
8 a 4
9 a 4
10 a 4
11 b 2
12 b 2
13 b 2
14 b 2
15 b 4
16 b 4
17 b 4
18 b 4
19 b 4
20 b 4
Any advice would be much appreciated.
You could use cumany() to find all cases after the first event, i.e. num > 3.5.
library(dplyr)
df1 %>%
group_by(grp) %>%
mutate(num2 = replace(num, cumany(num > 3.5), 4)) %>%
ungroup()
# A tibble: 20 × 3
grp num num2
<chr> <dbl> <dbl>
1 a 0 0
2 a 1 1
3 a 2 2
4 a 5 4
5 a 0 4
6 a 1 4
7 a 7 4
8 a 0 4
9 a 2 4
10 a 1 4
11 b 2 2
12 b 2 2
13 b 2 2
14 b 2 2
15 b 5 4
16 b 0 4
17 b 0 4
18 b 0 4
19 b 0 4
20 b 6 4
You can also replace cumany(num > 3.5) with cumsum(num > 3.5) > 0.

grouping to aggregate values, but tripping up on NA's

I have long data, and I am trying to make a new variable (consistent) that is the value for a given column (VALUE), for each person (ID), at TIME = 2. I used the code below to do this, but I am getting tripped up on NA's. If the VALUE for TIME = 2 is NA, then I want it to grab the VALUE at TIME = 1 instead. That part I'm not sure how to do. So, in the example below, I want the new variable (consistent) should be 10 instead of NA.
ID = c("A", "A", "B", "B", "C", "C", "D", "D")
TIME = c(1, 2, 1, 2, 1, 2, 1, 2)
VALUE = c(8, 9, 10, NA, 12, 13, 14, 9)
df = data.frame(ID, TIME, VALUE)
df <- df %>%
group_by(ID) %>%
mutate(consistent = VALUE[TIME == 2]) %>% ungroup
df
If we want to use the same code, then coalesce with the 'VALUE' where 'TIME' is 1 (assuming there is a single observation of 'TIME' for each 'ID')
library(dplyr)
df %>%
group_by(ID) %>%
mutate(consistent = coalesce(VALUE[TIME == 2], VALUE[TIME == 1])) %>%
ungroup
-output
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 1 8 9
2 A 2 9 9
3 B 1 10 10
4 B 2 NA 10
5 C 1 12 13
6 C 2 13 13
7 D 1 14 9
8 D 2 9 9
Or another option is to arrange before doing the group_by and get the first element of 'VALUE' (assuming no replicating for 'TIME')
df %>%
arrange(ID, is.na(VALUE), desc(TIME)) %>%
group_by(ID) %>%
mutate(consistent = first(VALUE)) %>%
ungroup
-output
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 2 9 9
2 A 1 8 9
3 B 1 10 10
4 B 2 NA 10
5 C 2 13 13
6 C 1 12 13
7 D 2 9 9
8 D 1 14 9
Another possible solution, using tidyr::fill:
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(consistent = VALUE) %>% fill(consistent) %>% ungroup
#> # A tibble: 8 × 4
#> ID TIME VALUE consistent
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 8 8
#> 2 A 2 9 9
#> 3 B 1 10 10
#> 4 B 2 NA 10
#> 5 C 1 12 12
#> 6 C 2 13 13
#> 7 D 1 14 14
#> 8 D 2 9 9
You can also use ifelse with your condition. TIME is guaranteed to be 1 in this scenario if there are only 2 group member each with TIME 1 and 2.
df %>%
group_by(ID) %>%
arrange(TIME, .by_group=T) %>%
mutate(consistent=ifelse(is.na(VALUE)&TIME==2, lag(VALUE), VALUE)) %>%
ungroup()
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 1 8 8
2 A 2 9 9
3 B 1 10 10
4 B 2 NA 10
5 C 1 12 12
6 C 2 13 13
7 D 1 14 14
8 D 2 9 9

Count variable until observations changes [duplicate]

This question already has answers here:
Create counter within consecutive runs of values
(3 answers)
Closed 1 year ago.
Unfortunately, I can't wrap my head around this but I'm sure there is a straightforward solution. I've a data.frame that looks like this:
set.seed(1)
mydf <- data.frame(group=sample(c("a", "b"), 20, replace=T))
I'd like to create a new variable that counts from top to bottom, how many times the group occured in a row. Hence, within the example from above it should look like this:
mydf$question <- c(1, 2, 1, 2, 1, 1, 2, 3, 4, 1, 2, 3, 1, 1, 1, 1, 1, 2, 1, 1)
> mydf[1:10,]
group question
1 a 1
2 a 2
3 b 1
4 b 2
5 a 1
6 b 1
7 b 2
8 b 3
9 b 4
10 a 1
Thanks for help.
Using data.table::rleid and dplyr you could do:
set.seed(1)
mydf <- data.frame(group=sample(c("a", "b"), 20, replace=T))
library(dplyr)
library(data.table)
mydf %>%
mutate(id = data.table::rleid(group)) %>%
group_by(id) %>%
mutate(question = row_number()) %>%
ungroup()
#> # A tibble: 20 × 3
#> group id question
#> <chr> <int> <int>
#> 1 a 1 1
#> 2 b 2 1
#> 3 a 3 1
#> 4 a 3 2
#> 5 b 4 1
#> 6 a 5 1
#> 7 a 5 2
#> 8 a 5 3
#> 9 b 6 1
#> 10 b 6 2
#> 11 a 7 1
#> 12 a 7 2
#> 13 a 7 3
#> 14 a 7 4
#> 15 a 7 5
#> 16 b 8 1
#> 17 b 8 2
#> 18 b 8 3
#> 19 b 8 4
#> 20 a 9 1
Update: Most is the same as stefan but without data.table package:
library(dplyr)
mydf %>%
mutate(myrleid = with(rle(group), rep(seq_along(lengths), lengths))) %>%
group_by(myrleid) %>%
mutate(question = row_number()) %>%
ungroup()
group myrleid question
<chr> <int> <int>
1 a 1 1
2 b 2 1
3 a 3 1
4 a 3 2
5 b 4 1
6 a 5 1
7 a 5 2
8 a 5 3
9 b 6 1
10 b 6 2
11 a 7 1
12 a 7 2
13 a 7 3
14 a 7 4
15 a 7 5
16 b 8 1
17 b 8 2
18 b 8 3
19 b 8 4
20 a 9 1

Convert tibble to long form when the variables consist of several parts

I have such data
library(tidyverse)
df = tribble(
~id, ~a1, ~a2, ~a3, ~b1, ~b2, ~b3, ~c1, ~c2, ~c3,
1, 1, 4, 7, 11, 14, 17, 21, 24, 27,
2, 2, 5, 8, 12, 15, 18, 22, 25, 28,
3, 3, 6, 8, 13, 16, 19, 23, 26, 29,
)
I would like to convert it to a long form, where the variable names contain two parts name (a, b, c) and number (1, 2, 3) which should become new variables in the long version of the table as below.
id name nr data
1 1 a 1 1
2 2 a 1 2
3 3 a 1 3
4 1 a 2 4
5 2 a 2 5
6 3 a 2 6
7 1 a 3 7
8 2 a 3 8
9 3 a 3 8
10 1 b 1 11
11 2 b 1 12
12 3 b 1 13
13 1 b 2 14
14 2 b 2 15
15 3 b 2 16
16 1 b 3 17
17 2 b 3 18
18 3 b 3 19
19 1 C 1 21
20 2 C 1 22
21 3 C 1 23
22 1 C 2 24
23 2 C 2 25
24 3 C 2 26
25 1 C 3 27
26 2 C 3 28
27 3 C 3 29
Can it be done simply by using the functions in the dplyr package? I tried the pivot_longer effect was disappointing.
Any prompts are welcome.
I know this question has been asked before, but I can't find a good duplicate target. In the meantime, if you specify the regex to differentiate between the name portion and the nr portion of your column names, you can do it in one function call:
df %>%
pivot_longer(-id, names_to = c("name", "nr"),
values_to = "data",
names_pattern = "(^[a-z])(\\d$)")
#> # A tibble: 27 × 4
#> id name nr data
#> <dbl> <chr> <chr> <dbl>
#> 1 1 a 1 1
#> 2 1 a 2 4
#> 3 1 a 3 7
#> 4 1 b 1 11
#> 5 1 b 2 14
#> 6 1 b 3 17
#> 7 1 c 1 21
#> 8 1 c 2 24
#> 9 1 c 3 27
#> 10 2 a 1 2
#> # … with 17 more rows
Adapt the regex as needed if you have different column names in practice, but this separates them so that the first piece comes from a single lowercase letter at the beginning of the string, and the second piece comes from a single number at the end of the string.
We may do this in a couple of ways - i.e. first reshape to 'long' format with pivot_longer excluding the 'id' column, then just separate the 'name' column into two by specifying the sep as a regex lookaround i.e. (as there is only a single lower case letter), split after the first occurrence of the letter ((?<=[a-z]))
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -id, names_to = 'name', values_to = 'data') %>%
separate(name, into = c("name", 'nr'), sep = "(?<=[a-z])")
-output
A tibble: 27 × 4
id name nr data
<dbl> <chr> <chr> <dbl>
1 1 a 1 1
2 1 a 2 4
3 1 a 3 7
4 1 b 1 11
5 1 b 2 14
6 1 b 3 17
7 1 c 1 21
8 1 c 2 24
9 1 c 3 27
10 2 a 1 2
# … with 17 more rows
Or another optioin is to append a suffix in the column names and then use pivot_longer
library(stringr)
df %>%
rename_with(~ str_c(., "_data"), -id) %>%
pivot_longer(cols = -id, names_to = c("name", "nr", ".value"),
names_pattern = "^(.)(.)_(.*)")

Resources