R Tidy : Dynamic Sequential Threshold - r

I'm trying to find a tidy way to dynamically adjust a threshold as I "move" through a tibble using library(tidyverse). For example, imagine a tibble containing sequential observations:
example <-
tibble(observed = c(2,1,1,2,2,4,10,4,2,2,3))
example
# A tibble: 11 x 1
observed
<dbl>
1 2
2 1
3 1
4 2
5 2
6 4
7 10
8 4
9 2
10 2
11 3
I'm trying to calculate a threshold that starts with the initial value (2) and increments by a prespecified amount (in this case, 1) unless the current observation is greater than that threshold in which case the current observation becomes the reference threshold and further thresholds increment from it. Here is what the final tibble would look like:
answer <-
example %>%
mutate(threshold = c(2,3,4,5,6,7,10,11,12,13,14))
answer
# A tibble: 11 x 2
observed threshold
<dbl> <dbl>
1 2 2
2 1 3
3 1 4
4 2 5
5 2 6
6 4 7
7 10 10
8 4 11
9 2 12
10 2 13
11 3 14
I'm looking for the best way to do this using dplyr/tidy. All help is appreciated!
EDIT:
The answers so far are very close, but miss in the case that the observed values drop and increase again. For example consider the same tibble as example above, but with a 4 instead of a 3 for the final observation:
example <-
tibble(observed = c(2,1,1,2,2,4,10,4,2,2,4))
example
# A tibble: 11 x 1
observed
<dbl>
1 2
2 1
3 1
4 2
5 2
6 4
7 10
8 4
9 2
10 2
11 4
The diff & cumsum method then gives:
example %>%
group_by(gr = cumsum(c(TRUE, diff(observed) > thresh))) %>%
mutate(thresold = first(observed) + row_number() - 1) %>%
ungroup %>%
select(-gr)
A tibble: 11 x 2
observed thresold
<dbl> <dbl>
1 2 2
2 1 3
3 1 4
4 2 5
5 2 6
6 4 4
7 10 10
8 4 11
9 2 12
10 2 13
11 4 4
Where the final threshold value is incorrect.

You could use diff to create groups and add row number in the group to the first value.
library(dplyr)
thresh <- 1
example %>%
group_by(gr = cumsum(c(TRUE, diff(observed) > thresh))) %>%
mutate(thresold = first(observed) + row_number() - 1) %>%
ungroup %>%
select(-gr)
# A tibble: 11 x 2
# observed thresold
# <dbl> <dbl>
# 1 2 2
# 2 1 3
# 3 1 4
# 4 2 5
# 5 2 6
# 6 4 4
# 7 10 10
# 8 4 11
# 9 2 12
#10 2 13
#11 3 14
To understand how the groups are created here are the detailed steps :
We first calculate the difference between consecutive values
diff(example$observed)
#[1] -1 0 1 0 2 6 -6 -2 0 1
Note that diff gives output of length 1 less than the actual length.
We compare it with thresh which gives TRUE for every time we have value greater than the threshold
diff(example$observed) > thresh
#[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
Now since output of diff has one value less we add one value as TRUE
c(TRUE, diff(example$observed) > thresh)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
and then finally take cumsum to create groups which is used in group_by.
cumsum(c(TRUE, diff(example$observed) > thresh))
# [1] 1 1 1 1 1 2 3 3 3 3 3
EDIT
For the updated question we can add another condition to check of the previous value is greater than the current count and update the values accordingly.
example %>%
group_by(gr = cumsum(c(TRUE, diff(observed) > thresh) &
observed > first(observed) + row_number())) %>%
mutate(thresold = first(observed) + row_number() - 1) %>%
ungroup() %>%
select(-gr)
# A tibble: 11 x 2
# observed thresold
# <dbl> <dbl>
# 1 2 2
# 2 1 3
# 3 1 4
# 4 2 5
# 5 2 6
# 6 4 7
# 7 10 10
# 8 4 11
# 9 2 12
#10 2 13
#11 4 14

We can create the grouping variable with lag of the column difference
library(dplyr)
thresh <- 1
example %>%
group_by(grp = cumsum((observed - lag(observed, default = first(observed)) >
thresh))) %>%
mutate(threshold = observed[1] + row_number() - 1) %>%
ungroup %>%
mutate(new = row_number() + 1,
threshold = pmax(threshold, new)) %>%
select(-grp, -new)
# A tibble: 11 x 2
# observed threshold
# <dbl> <dbl>
# 1 2 2
# 2 1 3
# 3 1 4
# 4 2 5
# 5 2 6
# 6 4 7
# 7 10 10
# 8 4 11
# 9 2 12
#10 2 13
#11 3 14

I think I've figured out a way to do this, by utilizing zoo::locf (although I'm not sure this part is really necessary).
First create the harder of the two examples I've listed in my description:
example2 <-
tibble(observed = c(2,1,1,2,2,4,10,4,2,2,4))
example2 %>%
mutate(def = first(observed) + row_number() - 1) %>%
mutate(t1 = pmax(observed,def)) %>%
mutate(local_maxima = ifelse(observed == t1,t1,NA)) %>%
mutate(groupings = zoo::na.locf(local_maxima)) %>%
group_by(groupings) %>%
mutate(threshold = groupings + row_number() - 1) %>%
ungroup() %>%
select(-def,-t1,-local_maxima,-groupings)
Result:
# A tibble: 11 x 2
observed threshold
<dbl> <dbl>
1 2 2
2 1 3
3 1 4
4 2 5
5 2 6
6 4 7
7 10 10
8 4 11
9 2 12
10 2 13
11 4 14
I'd definitely prefer a more elegant solution if anyone finds one.

Related

How can I create a new column with mutate function in R that is a sequence of values of other columns in R?

I have a data frame that looks like this :
a
b
c
1
2
10
2
2
10
3
2
10
4
2
10
5
2
10
I want to create a column with mutate function of something else under the dplyr framework of functions (or base) that will be sequence from b to c (i.e from 2 to 10 with length the number of rows of this tibble or data frame)
Ideally my new data frame I want to like like this :
a
b
c
c
1
2
10
2
2
2
10
4
3
2
10
6
4
2
10
8
5
2
10
10
How can I do this with R using dplyr ?
library(tidyverse)
n=5
a = seq(1,n,length.out=n)
b = rep(2,n)
c = rep(10,n)
data = tibble(a,b,c)
We may do
library(dplyr)
data %>%
rowwise %>%
mutate(new = seq(b, c, length.out = n)[a]) %>%
ungroup
-output
# A tibble: 5 × 4
a b c new
<dbl> <dbl> <dbl> <dbl>
1 1 2 10 2
2 2 2 10 4
3 3 2 10 6
4 4 2 10 8
5 5 2 10 10
If you want this done "by group" for each a value (creating many new rows), we can create the sequence as a list column and then unnest it:
data %>%
mutate(result = map2(b, c, seq, length.out = n)) %>%
unnest(result)
# # A tibble: 25 × 4
# a b c result
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 10 2
# 2 1 2 10 4
# 3 1 2 10 6
# 4 1 2 10 8
# 5 1 2 10 10
# 6 2 2 10 2
# 7 2 2 10 4
# 8 2 2 10 6
# 9 2 2 10 8
# 10 2 2 10 10
# # … with 15 more rows
# # ℹ Use `print(n = ...)` to see more rows
If you want to keep the same number of rows and go from the first b value to the last c value, we can use seq directly in mutate:
data %>%
mutate(result = seq(from = first(b), to = last(c), length.out = n()))
# # A tibble: 5 × 4
# a b c result
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 10 2
# 2 2 2 10 4
# 3 3 2 10 6
# 4 4 2 10 8
# 5 5 2 10 10
This one?
library(dplyr)
df %>%
mutate(c1 = a*b)
a b c c1
1 1 2 10 2
2 2 2 10 4
3 3 2 10 6
4 4 2 10 8
5 5 2 10 10

Converting time-dependent variable to long format using one variable indicating day of update

I am trying to convert my data to a long format using one variable that indicates a day of the update.
I have the following variables:
baseline temperature variable "temp_b";
time-varying temperature variable "temp_v" and
the number of days "n_days" when the varying variable is updated.
I want to create a long format using the carried forward approach and a max follow-up time of 5 days.
Example of data
df <- structure(list(id=1:3, temp_b=c(20L, 7L, 7L), temp_v=c(30L, 10L, NA), n_days=c(2L, 4L, NA)), class="data.frame", row.names=c(NA, -3L))
# id temp_b temp_v n_days
# 1 1 20 30 2
# 2 2 7 10 4
# 3 3 7 NA NA
df_long <- structure(list(id=c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3),
days_cont=c(1,2,3,4,5, 1,2,3,4,5, 1,2,3,4,5),
long_format=c(20,30,30,30,30,7,7,7,10,10,7,7,7,7,7)),
class="data.frame", row.names=c(NA, -15L))
# id days_cont long_format
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7
You could repeat each row 5 times with tidyr::uncount():
library(dplyr)
df %>%
tidyr::uncount(5) %>%
group_by(id) %>%
transmute(days_cont = 1:n(),
temp = ifelse(row_number() < n_days | is.na(n_days), temp_b, temp_v)) %>%
ungroup()
# # A tibble: 15 × 3
# id days_cont temp
# <int> <int> <int>
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7
Here's a possibility using tidyverse functions. First, pivot_longer and get rid of unwanted values (that will not appear in the final df, i.e. values with temp_v == NA), then group_by id, and mutate the n_days variable to match the number of rows it will have in the final df. Finally, uncount the dataframe.
library(tidyverse)
df %>%
replace_na(list(n_days = 6)) %>%
pivot_longer(-c(id, n_days)) %>%
filter(!is.na(value)) %>%
group_by(id) %>%
mutate(n_days = case_when(name == "temp_b" ~ n_days - 1,
name == "temp_v" ~ 5 - (n_days - 1))) %>%
uncount(n_days) %>%
mutate(days_cont = row_number()) %>%
select(id, days_cont, long_format = value)
id days_cont long_format
<int> <int> <int>
1 1 1 20
2 1 2 30
3 1 3 30
4 1 4 30
5 1 5 30
6 2 1 7
7 2 2 7
8 2 3 7
9 2 4 10
10 2 5 10
11 3 1 7
12 3 2 7
13 3 3 7
14 3 4 7
15 3 5 7

Group data frame row by consecutive value in R [duplicate]

This question already has answers here:
Create grouping variable for consecutive sequences and split vector
(5 answers)
Closed 1 year ago.
I need to group a data frame by consecutive value in a row.
So for example, given this data frame:
tibble( time = c(1,2,3,4,5,10,11,20,30,31,32,40) )
I want to have a grouping column like:
tibble( time = c(1,2,3,4,5,10,11,20,30,31,32,40), group=c(1,1,1,1,1,2,2,3,4,4,4,5) )
What's a tidyverse (or base R) way to get the column group as explained?
We could it this way:
df %>%
arrange(time) %>%
group_by(grp = (time %/% 10)+1)
time group
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 10 2
7 11 2
8 20 3
9 30 4
10 31 4
11 32 4
12 40 5
We could use diff on the adjacent values of 'time', check if the difference is not equal to 1, then change the logical vector to numeric index by taking the cumulative sum (cumsum) so that there is an increment of 1 at each TRUE value
library(dplyr)
df1 %>%
mutate(grp = cumsum(c(TRUE, diff(time) != 1)))
-output
# A tibble: 12 x 2
time grp
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 10 2
7 11 2
8 20 3
9 30 4
10 31 4
11 32 4
12 40 5
You can use the following solution:
library(dplyr)
library(purrr)
df %>%
mutate(grp = accumulate(2:nrow(df), .init = 1,
~ if(time[.y] - time[.y - 1] == 1) {
.x
} else {
.x + 1
}))
# A tibble: 12 x 2
time grp
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 10 2
7 11 2
8 20 3
9 30 4
10 31 4
11 32 4
12 40 5

If 1 appears, all subsequent elements of the variable must be 1, grouped by subject

I want make from:
test <- data.frame(subject=c(rep(1,10),rep(2,10)),x=1:10,y=0:1)
Something like that:
As I wrote in the title, when the first 1 appears all subsequent values of "y" for a given "subject" must change to 1, then the same for the next "subject"
I tried something like that:
test <- test%>%
group_nest(subject) %>%
mutate(XD = map(data,function(x){
ifelse(x$y[which(grepl(1, x$y))[1]:nrow(x)]==TRUE , 1,0)})) %>% unnest(cols = c(data,XD))
It didn't work :(
Try this:
library(dplyr)
#Code
new <- test %>%
group_by(subject) %>%
mutate(y=ifelse(row_number()<min(which(y==1)),y,1))
Output:
# A tibble: 20 x 3
# Groups: subject [2]
subject x y
<dbl> <int> <dbl>
1 1 1 0
2 1 2 1
3 1 3 1
4 1 4 1
5 1 5 1
6 1 6 1
7 1 7 1
8 1 8 1
9 1 9 1
10 1 10 1
11 2 1 0
12 2 2 1
13 2 3 1
14 2 4 1
15 2 5 1
16 2 6 1
17 2 7 1
18 2 8 1
19 2 9 1
20 2 10 1
Since you appear to just have 0's and 1's, a straightforward approach would be to take a cumulative maximum via the cummax function:
library(dplyr)
test %>%
group_by(subject) %>%
mutate(y = cummax(y))
#Duck's answer is considerably more robust if you have a range of values that may appear before or after the first 1.

How to balance a dataset in `dplyr` using `sample_n` automatically to the size of the smallest class?

I have a dataset like:
df <- tibble(
id = 1:18,
class = rep(c(rep(1,3),rep(2,2),3),3),
var_a = rep(c("a","b"),9)
)
# A tibble: 18 x 3
id cluster var_a
<int> <dbl> <chr>
1 1 1 a
2 2 1 b
3 3 1 a
4 4 2 b
5 5 2 a
6 6 3 b
7 7 1 a
8 8 1 b
9 9 1 a
10 10 2 b
11 11 2 a
12 12 3 b
13 13 1 a
14 14 1 b
15 15 1 a
16 16 2 b
17 17 2 a
18 18 3 b
That dataset contains a number of observations in several classes. The classes are not balanced. In the sample above we can see, that only 3 observations are of class 3, while there are 6 observations of class 2 and 9 observations of class 1.
Now I want to automatically balance that dataset so that all classes are of the same size. So I want a dataset of 9 rows, 3 rows in each class. I can use the sample_n function from dplyr to do such a sampling.
I achieved to do so by first calculating the smallest class size..
min_length <- as.numeric(df %>%
group_by(class) %>%
summarise(n = n()) %>%
ungroup() %>%
summarise(min = min(n)))
..and then apply the sample_n function:
set.seed(1)
df %>% group_by(cluster) %>% sample_n(min_length)
# A tibble: 9 x 3
# Groups: cluster [3]
id cluster var_a
<int> <dbl> <chr>
1 15 1 a
2 7 1 a
3 13 1 a
4 4 2 b
5 5 2 a
6 17 2 a
7 18 3 b
8 6 3 b
9 12 3 b
I wondered If it's possible to do that (calculating the smallest class size and then sampling) in one go?
You can do it in one step, but it is cheating a little:
set.seed(42)
df %>%
group_by(class) %>%
sample_n(min(table(df$class))) %>%
ungroup()
# # A tibble: 9 x 3
# id class var_a
# <int> <dbl> <chr>
# 1 1 1 a
# 2 8 1 b
# 3 15 1 a
# 4 4 2 b
# 5 5 2 a
# 6 11 2 a
# 7 12 3 b
# 8 18 3 b
# 9 6 3 b
I say "cheating" because normally you would not want to reference df$ from within the pipe. However, because they property we're looking for is of the whole frame but the table function only sees one group at a time, we need to side-step that a little.
One could do
df %>%
mutate(mn = min(table(class))) %>%
group_by(class) %>%
sample_n(mn[1]) %>%
ungroup()
# # A tibble: 9 x 4
# id class var_a mn
# <int> <dbl> <chr> <int>
# 1 14 1 b 3
# 2 13 1 a 3
# 3 7 1 a 3
# 4 4 2 b 3
# 5 16 2 b 3
# 6 5 2 a 3
# 7 12 3 b 3
# 8 18 3 b 3
# 9 6 3 b 3
Though I don't think that that is any more elegant/readable.

Resources