The Most Efficient Way of Forming Groups using R - r

I have a tibble dt given as follows:
library(tidyverse)
dt <- tibble(x=as.integer(c(0,0,1,0,0,0,1,1,0,1))) %>%
mutate(grp = as.factor(c(rep("A",3), rep("B",4), rep("C",1), rep("D",2))))
dt
As one can observe the rule for grouping is:
starts 0 and ends with 1 (e.g., groups A, B, D) or
it solely contains 1 (e.g., group C)
Problem: Given a tibble with column integer vector x of zeros and 1 that starts with 0 and ends in 1, what is the most efficient way to obtain a grouping using R? (You can use any grouping symbols/factors.)

We can get the cumulative sum of 'x' (assuming it is binary), take the lag add 1 and use that index to replace it with LETTERS (Note that LETTERS was used only as part of matching with the expected output - it can take go up to certain limit)
library(dplyr)
dt %>%
mutate(grp2 = LETTERS[lag(cumsum(x), default = 0)+ 1])
-output
# A tibble: 10 x 3
x grp grp2
<int> <fct> <chr>
1 0 A A
2 0 A A
3 1 A A
4 0 B B
5 0 B B
6 0 B B
7 1 B B
8 1 C C
9 0 D D
10 1 D D

Though the strategy proposed by Akrun is fantastic, yet to show that it can be managed through accumulate also
library(tidyverse)
dt <- tibble(x=as.integer(c(0,0,1,0,0,0,1,1,0,1))) %>%
mutate(grp = as.factor(c(rep("A",3), rep("B",4), rep("C",1), rep("D",2))))
dt %>%
mutate(GRP = accumulate(lag(x, default = 0),.init =1, ~ if(.y != 1) .x else .x+1)[-1])
#> # A tibble: 10 x 3
#> x grp GRP
#> <int> <fct> <dbl>
#> 1 0 A 1
#> 2 0 A 1
#> 3 1 A 1
#> 4 0 B 2
#> 5 0 B 2
#> 6 0 B 2
#> 7 1 B 2
#> 8 1 C 3
#> 9 0 D 4
#> 10 1 D 4
Created on 2021-06-13 by the reprex package (v2.0.0)

Related

How can I find the column index of the first non-zero value in a row with R dplyr?

I'm working in R. I have a dataset of COVID case totals that looks like this:
Facility
Day_1
Day_2
Day_3
A
0
0
1
B
1
2
5
C
0
2
6
D
0
0
0
I would like to use mutate() to create a new column, first_case, that has the column index of the first non-zero element in each row -- or "NA" if there is no non-zero element. I thought about using where(), but couldn't quite figure out how to get a column index instead of a row index.
Any help is much appreciated!
We can use max.col to get the first instance when the value is non-zero in each zero.
library(dplyr)
df %>%
mutate(first_case = {
tmp <- select(., starts_with('Day'))
ifelse(rowSums(tmp) == 0, NA, max.col(tmp != 0, ties.method = 'first'))
})
# Facility Day_1 Day_2 Day_3 first_case
#1 A 0 0 1 3
#2 B 1 2 5 1
#3 C 0 2 6 2
#4 D 0 0 0 NA
first_case has column number of the 'Day' columns, if you need column number in the data you can add + 1 to above output.
This is probably unnecessarily complex, because the data is not in a long ('tidy') format that dplyr etc expect.
datlong <- dat %>%
pivot_longer(cols=starts_with("Day"), names_to = c("day"), names_pattern="_(\\d+)")
## A tibble: 12 x 3
# Facility day value
# <chr> <chr> <int>
# 1 A 1 0
# 2 A 2 0
# 3 A 3 1
# 4 B 1 1
# 5 B 2 2
# 6 B 3 5
# 7 C 1 0
# 8 C 2 2
# 9 C 3 6
#10 D 1 0
#11 D 2 0
#12 D 3 0
It's then simple to get the first/second/third/[n]th day above whatever value, as well as to calculate minimums, maximums, means, weekly averages, rolling averages, whatever, because you are now dealing with a plain old vector of values rather than a list of values across multiple columns.
datlong %>%
group_by(Facility) %>%
filter(value > 0, .preserve=TRUE) %>%
summarise(first_day = first(day))
#`summarise()` ungrouping output (override with `.groups` argument)
## A tibble: 4 x 2
# Facility first_day
# <chr> <chr>
#1 A 3
#2 B 1
#3 C 2
#4 D <NA>
Alternative using indexes and stuff, which is less dplyr-like:
datlong %>%
group_by(Facility) %>%
summarise(first_day = day[value > 0][1])

Add column to grouped data that assigns 1 to individuals and randomly assigns 1 or 0 to pairs

I have a dataframe...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e")
)
Families will only contain 2 members at most (so they're either individuals or pairs).
I need a new column 'random' that assigns the number 1 to families where there is only one member (e.g. c, d and e) and randomly assigns 0 or 1 to families containing 2 members (a and b in the example).
By the end the data should look like the following (depending on the random assignment of 0/1)...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e"),
random = c(1, 0, 0, 1, 1, 1, 1)
)
I would like to be able to do this with a combination of group_by and mutate since I am mostly using Tidyverse.
I tried the following (but this didn't randomly assign 0/1 within families)...
df %>%
group_by(family) %>%
mutate(
random = if_else(
condition = n() == 1,
true = 1,
false = as.double(sample(0:1,1,replace = T))
)
You could sample along the sequence length of the family group and take the answer modulo 2:
df %>%
group_by(family) %>%
mutate(random = sample(seq(n())) %% 2)
#> # A tibble: 7 x 3
#> # Groups: family [5]
#> id family random
#> <int> <chr> <dbl>
#> 1 1 a 0
#> 2 2 a 1
#> 3 3 b 0
#> 4 4 b 1
#> 5 5 c 1
#> 6 6 d 1
#> 7 7 e 1
We can use if/else
library(dplyr)
df %>%
group_by(family) %>%
mutate(random = if(n() == 1) 1 else sample(rep(0:1, length.out = n())))
# A tibble: 7 x 3
# Groups: family [5]
# id family random
# <int> <chr> <dbl>
#1 1 a 0
#2 2 a 1
#3 3 b 1
#4 4 b 0
#5 5 c 1
#6 6 d 1
#7 7 e 1
Another option
df %>%
group_by(family) %>%
mutate(random = 2 - sample(1:n()))
# A tibble: 7 x 3
# Groups: family [5]
id family random
# <int> <chr> <dbl>
# 1 1 a 1
# 2 2 a 0
# 3 3 b 1
# 4 4 b 0
# 5 5 c 1
# 6 6 d 1
# 7 7 e 1

filter all rows smaller than x with all following values also smaller than x

I am looking for a concise way to filter a data.frame for all rows smaller than a value x with all following values also smaller than x. I found a way but it is somehwat verbose. I tried to do it with dplyr::cumall and cumany, but was not able to figure it out.
Here is a small reprex including my actual approach. Ideally I would only have one filter line or mutate + filter, but with the current approach it takes two rounds of mutate/filter.
library(dplyr)
# Original data
tbl <- tibble(value = c(100,100,100,10,10,5,10,10,5,5,5,1,1,1,1))
# desired output:
# keep only rows, where value is smaller than 5 and ...
# no value after that is larger than 5
tbl %>%
mutate(id = row_number()) %>%
filter(value <= 5) %>%
mutate(id2 = lead(id, default = max(id) + 1) - id) %>%
filter(id2 == 1)
#> # A tibble: 7 x 3
#> value id id2
#> <dbl> <int> <dbl>
#> 1 5 9 1
#> 2 5 10 1
#> 3 5 11 1
#> 4 1 12 1
#> 5 1 13 1
#> 6 1 14 1
#> 7 1 15 1
Created on 2020-04-20 by the reprex package (v0.3.0)
You could combine cummin with a reversed reverse cummax:
tbl %>% filter(rev(cummax(rev(value))) <= 5 & cummin(value) <= 5)
# A tibble: 7 x 1
value
<dbl>
1 5
2 5
3 5
4 1
5 1
6 1
7 1
A base R option is to use subset + rle
tblout <- subset(tbl,
with(rle(value<=5 & c(0,diff(value))<=0),
rep(lengths>1 & values,lengths)))
such that
> tblout
# A tibble: 7 x 1
value
<dbl>
1 5
2 5
3 5
4 1
5 1
6 1
7 1

dplyr mutate: create column using first occurrence of another column

I was wondering if there's a more elegant way of taking a dataframe, grouping by x to see how many x's occur in the dataset, then mutating to find the first occurrence of every x (y)
test <- data.frame(x = c("a", "b", "c", "d",
"c", "b", "e", "f", "g"),
y = c(1,1,1,1,2,2,2,2,2))
x y
1 a 1
2 b 1
3 c 1
4 d 1
5 c 2
6 b 2
7 e 2
8 f 2
9 g 2
Current Output
output <- test %>%
group_by(x) %>%
summarise(count = n())
x count
<fct> <int>
1 a 1
2 b 2
3 c 2
4 d 1
5 e 1
6 f 1
7 g 1
Desired Output
x count first_seen
<fct> <int> <dbl>
1 a 1 1
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 2
6 f 1 2
7 g 1 2
I can filter the test dataframe for the first occurrences then use a left_join but was hoping there's a more elegant solution using mutate?
# filter for first occurrences of y
right <- test %>%
group_by(x) %>%
filter(y == min(y)) %>%
slice(1) %>%
ungroup()
# bind to the output dataframe
left_join(output, right, by = "x")
We can use first after grouping by 'x' to create a new column, use that also in group_by and get the count with n()
library(dplyr)
test %>%
group_by(x) %>%
group_by(first_seen = first(y), add = TRUE) %>%
summarise(count = n())
# A tibble: 7 x 3
# Groups: x [7]
# x first_seen count
# <fct> <dbl> <int>
#1 a 1 1
#2 b 1 2
#3 c 1 2
#4 d 1 1
#5 e 2 1
#6 f 2 1
#7 g 2 1
I have a question. Why not keep it simple? for example
test %>%
group_by(x) %>%
summarise(
count = n(),
first_seen = first(y)
)
#> # A tibble: 7 x 3
#> x count first_seen
#> <chr> <int> <dbl>
#> 1 a 1 1
#> 2 b 2 1
#> 3 c 2 1
#> 4 d 1 1
#> 5 e 1 2
#> 6 f 1 2
#> 7 g 1 2

Filter (subset) by conditions in 2 columns in R (dplyr or otherwise)

Given a dataset such as:
set.seed(134)
df<- data.frame(ID= rep(LETTERS[1:5], each=2),
condition=rep(0:1, 5),
value=rpois(10, 3)
)
df
ID condition value
1 A 0 2
2 A 1 3
3 B 0 5
4 B 1 2
5 C 0 3
6 C 1 1
7 D 0 2
8 D 1 4
9 E 0 1
10 E 1 5
For each ID, when the value for condition==0 is less than the value for condition==1, I want to keep both observations. When the value for condition==0 is greater than condition==1, I want to keep only the row for condition==0.
The subset returned should be this:
ID condition value
1 A 0 2
2 A 1 3
3 B 0 5
5 C 0 3
7 D 0 2
8 D 1 4
9 E 0 1
10 E 1 5
Using dplyr the first step is:
df %>% group_by(ID) %>%
But not sure where to go from there.
Translating fairly literally,
library(dplyr)
set.seed(134)
df <- data.frame(ID = rep(LETTERS[1:5], each = 2),
condition = rep(0:1, 5),
value = rpois(10, 3))
df %>% group_by(ID) %>%
filter(condition == 0 |
(condition == 1 & value > value[condition == 0]))
#> # A tibble: 8 x 3
#> # Groups: ID [5]
#> ID condition value
#> <fct> <int> <int>
#> 1 A 0 2
#> 2 A 1 3
#> 3 B 0 5
#> 4 C 0 3
#> 5 D 0 2
#> 6 D 1 4
#> 7 E 0 1
#> 8 E 1 5
This depends on each group having a single observation with condition == 0, but should otherwise be fairly robust.
This is may not be the easiest way, but should work as you want.
library(reshape2)
df %>%
dcast(ID ~ condition, value.var = 'value') %>% # cast to wide format
mutate(`1` = ifelse(`1` > `0`, `1`, NA)) %>% # turn 0>1 values as NA
melt('ID') %>% # melt as long format
arrange(ID) %>% # sort by ID
filter(complete.cases(.)) # remove NA rows
Output:
ID variable value
1 A 0 2
2 A 1 3
3 B 0 5
4 C 0 3
5 D 0 2
6 D 1 4
7 E 0 1
8 E 1 5
You always want the value from the first row in each group. You only want the value from the second row in each group if it's larger than the first.
This works:
df %>%
group_by(ID) %>%
filter(row_number() == 1 | value > lag(value))
Edit: as #alistaire points out, this method depends on a particular order in, which is might be a good idea to guarantee as follows:
df %>%
arrange(ID, condition) %>%
group_by(ID) %>%
filter(row_number() == 1 | value > lag(value))

Resources