Rank subgroup by group (dplyr) - r

This question addresses how to assign the rank of a row within a group. I would like to assign the rank of a subgroup to a row within that subgroup. What I'm really getting at is that I need an abbreviation of the second group_by variable that is guaranteed to be unique, and this is the best way I can think of to go about doing that. Hopefully the desired output below makes this clear enough.
Input dataframe:
my_df <- tibble(
var1 = c(rep("A", 8), rep("B", 12)),
var2 = c(rep("long_string_x", 4),
rep("long_string_y", 4),
rep("long_string_x", 4),
rep("long_string_y", 4),
rep("long_string_z", 4))
)
Desired output:
# A tibble: 20 x 3
var1 var2 group_rank
<chr> <chr> <dbl>
1 A long_string_x 1
2 A long_string_x 1
3 A long_string_x 1
4 A long_string_x 1
5 A long_string_y 2
6 A long_string_y 2
7 A long_string_y 2
8 A long_string_y 2
9 B long_string_x 1
10 B long_string_x 1
11 B long_string_x 1
12 B long_string_x 1
13 B long_string_y 2
14 B long_string_y 2
15 B long_string_y 2
16 B long_string_y 2
17 B long_string_z 3
18 B long_string_z 3
19 B long_string_z 3
20 B long_string_z 3
How may I assign group_rank as above, ideally (but not necessarily) using a tidyverse approach?

We could use match after grouping
library(dplyr)
my_df %>%
group_by(var1) %>%
mutate(group_rank = match(var2, unique(var2))) %>%
ungroup
-output
# A tibble: 20 x 3
var1 var2 group_rank
<chr> <chr> <int>
1 A long_string_x 1
2 A long_string_x 1
3 A long_string_x 1
4 A long_string_x 1
5 A long_string_y 2
6 A long_string_y 2
7 A long_string_y 2
8 A long_string_y 2
9 B long_string_x 1
10 B long_string_x 1
11 B long_string_x 1
12 B long_string_x 1
13 B long_string_y 2
14 B long_string_y 2
15 B long_string_y 2
16 B long_string_y 2
17 B long_string_z 3
18 B long_string_z 3
19 B long_string_z 3
20 B long_string_z 3

using the approach to solving the problem of a respected #akrun
library(tidyverse)
my_df <- tibble(
var1 = c(rep("A", 8), rep("B", 12)),
var2 = c(rep("long_string_x", 4),
rep("long_string_y", 4),
rep("long_string_x", 4),
rep("long_string_y", 4),
rep("long_string_z", 4))
)
my_df %>%
group_by(var1) %>%
mutate(res = data.table::rleid(var2))
#> # A tibble: 20 x 3
#> # Groups: var1 [2]
#> var1 var2 res
#> <chr> <chr> <int>
#> 1 A long_string_x 1
#> 2 A long_string_x 1
#> 3 A long_string_x 1
#> 4 A long_string_x 1
#> 5 A long_string_y 2
#> 6 A long_string_y 2
#> 7 A long_string_y 2
#> 8 A long_string_y 2
#> 9 B long_string_x 1
#> 10 B long_string_x 1
#> 11 B long_string_x 1
#> 12 B long_string_x 1
#> 13 B long_string_y 2
#> 14 B long_string_y 2
#> 15 B long_string_y 2
#> 16 B long_string_y 2
#> 17 B long_string_z 3
#> 18 B long_string_z 3
#> 19 B long_string_z 3
#> 20 B long_string_z 3
Created on 2021-07-12 by the reprex package (v2.0.0)

Update:
As Greg pointed out (see comments) that group_by() default is .add = FALSE the intention was to use group_by twice -> then .add = TRUE should be added.
like:
library(dplyr)
my_df %>%
group_by(var1) %>%
mutate(group_rank = cur_group_id()) %>%
group_by(var2, .add=TRUE) %>%
mutate(group_rank = cur_group_id())
But in this case as Greg pointed out -> this is enough:
my_df %>% group_by(var2) %>% mutate(group_rank = cur_group_id())
First answer:
We could use cur_group_id() twice:
library(dplyr)
my_df %>%
group_by(var1) %>%
mutate(group_rank = cur_group_id()) %>%
group_by(var2) %>%
mutate(group_rank = cur_group_id())
Output:
var1 var2 group_rank
<chr> <chr> <int>
1 A long_string_x 1
2 A long_string_x 1
3 A long_string_x 1
4 A long_string_x 1
5 A long_string_y 2
6 A long_string_y 2
7 A long_string_y 2
8 A long_string_y 2
9 B long_string_x 1
10 B long_string_x 1
11 B long_string_x 1
12 B long_string_x 1
13 B long_string_y 2
14 B long_string_y 2
15 B long_string_y 2
16 B long_string_y 2
17 B long_string_z 3
18 B long_string_z 3
19 B long_string_z 3
20 B long_string_z 3

Related

Nested list to grouped rows in R

I have the following nested list called l (dput below):
> l
$A
$A$`1`
[1] 1 2 3
$A$`2`
[1] 3 2 1
$B
$B$`1`
[1] 2 2 2
$B$`2`
[1] 3 4 3
I would like to convert this to a grouped dataframe where A and B are the first group column and 1 and 2 are the subgroups with respective values. The desired output should look like this:
group subgroup values
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 3
5 A 2 2
6 A 2 1
7 B 1 2
8 B 1 2
9 B 1 2
10 B 2 3
11 B 2 4
12 B 2 3
As you can see A and B are the main group and 1 and 2 are the subgroups. Using purrr::flatten(l) or unnest doesn't work. So I was wondering if anyone knows how to convert a nested list to a grouped row dataframe?
dput of l:
l <- list(A = list(`1` = c(1, 2, 3), `2` = c(3, 2, 1)), B = list(`1` = c(2,
2, 2), `2` = c(3, 4, 3)))
Using stack and rowbind with id:
data.table::rbindlist(lapply(l, stack), idcol = "id")
# id values ind
# 1: A 1 1
# 2: A 2 1
# 3: A 3 1
# 4: A 3 2
# 5: A 2 2
# 6: A 1 2
# 7: B 2 1
# 8: B 2 1
# 9: B 2 1
# 10: B 3 2
# 11: B 4 2
# 12: B 3 2
You can use enframe() to convert the list into a data.frame, and unnest the value column twice.
library(tidyr)
tibble::enframe(l, name = "group") %>%
unnest_longer(value, indices_to = "subgroup") %>%
unnest(value)
# A tibble: 12 × 3
group value subgroup
<chr> <dbl> <chr>
1 A 1 1
2 A 2 1
3 A 3 1
4 A 3 2
5 A 2 2
6 A 1 2
7 B 2 1
8 B 2 1
9 B 2 1
10 B 3 2
11 B 4 2
12 B 3 2
Turn the list directly into a data frame, then pivot it into a long format and arrange to your desired order.
library(tidyverse)
lst %>%
as.data.frame() %>%
pivot_longer(everything(), names_to = c("group", "subgroup"),
values_to = "values",
names_pattern = "(.+?)\\.(.+?)") %>%
arrange(group, subgroup)
# A tibble: 12 × 3
group subgroup values
<chr> <chr> <dbl>
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 3
5 A 2 2
6 A 2 1
7 B 1 2
8 B 1 2
9 B 1 2
10 B 2 3
11 B 2 4
12 B 2 3
You can combine rrapply with unnest, which has the benefit to work in lists of arbitrary lengths:
library(rrapply)
library(tidyr)
rrapply(l, how = "melt") |>
unnest(value)
# A tibble: 12 × 3
L1 L2 value
<chr> <chr> <dbl>
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 3
5 A 2 2
6 A 2 1
7 B 1 2
8 B 1 2
9 B 1 2
10 B 2 3
11 B 2 4
12 B 2 3

How to iteratively add rows of one tibble to tibbles within a list of tibbles?

Given a list of tibbles
library(dplyr)
library(purrr)
ltb <- list(tibble(a=1:10, b=1:10), tibble(a=1:10, b=1:10), tibble(a=1:10, b=1:10))
map(ltb, ~head(., 2))
[[1]]
# A tibble: 2 × 2
a b
<int> <int>
1 1 1
2 2 2
[[2]]
# A tibble: 2 × 2
a b
<int> <int>
1 1 1
2 2 2
[[3]]
# A tibble: 2 × 2
a b
<int> <int>
1 1 1
2 2 2
and another single tibble whose number of rows matches the number of elements in the above list
tib <- tibble(data1 = letters[1:3], data2 = LETTERS[1:3], data3 = letters[1:3])
> tib
# A tibble: 3 × 3
data1 data2 data3
<chr> <chr> <chr>
1 a A a
2 b B b
3 c C c
how can I bind the first row of tib to the first tibble in ltb, the second row of tib to the second tibble in ltb? Obviously, this should recycle the rows in tib to (dynamically) match the number of rows in each tibble in ltb.
So the result should look something like this
map(newltb, ~head(., 3))
[[1]]
# A tibble: 3 × 2
a b data1 data2 data3
<int> <int> <chr> <chr> <chr>
1 1 1 a A a
2 2 2 a A a
3 3 3 a A a
[[2]]
# A tibble: 3 × 2
a b data1 data2 data3
<int> <int> <chr> <chr> <chr>
1 1 1 b B b
2 2 2 b B b
3 3 3 b B b
[[3]]
# A tibble: 3 × 2
a b data1 data2 data3
<int> <int> <chr> <chr> <chr>
1 1 1 c C c
2 2 2 c C c
3 3 3 c C c
I struggle whether to use map2 or pmap2, neither one have worked for me.
You could split tib by rows and use map2 and bind_cols like so:
library(dplyr, warn = FALSE)
library(purrr)
ltb <- list(tibble(a=1:10, b=1:10), tibble(a=1:10, b=1:10), tibble(a=1:10, b=1:10))
tib <- tibble(data1 = letters[1:3], data2 = LETTERS[1:3], data3 = letters[1:3])
tib_split <- tib %>%
split(seq(nrow(.)))
map2(ltb, tib_split, bind_cols)
#> [[1]]
#> # A tibble: 10 × 5
#> a b data1 data2 data3
#> <int> <int> <chr> <chr> <chr>
#> 1 1 1 a A a
#> 2 2 2 a A a
#> 3 3 3 a A a
#> 4 4 4 a A a
#> 5 5 5 a A a
#> 6 6 6 a A a
#> 7 7 7 a A a
#> 8 8 8 a A a
#> 9 9 9 a A a
#> 10 10 10 a A a
#>
#> [[2]]
#> # A tibble: 10 × 5
#> a b data1 data2 data3
#> <int> <int> <chr> <chr> <chr>
#> 1 1 1 b B b
#> 2 2 2 b B b
#> 3 3 3 b B b
#> 4 4 4 b B b
#> 5 5 5 b B b
#> 6 6 6 b B b
#> 7 7 7 b B b
#> 8 8 8 b B b
#> 9 9 9 b B b
#> 10 10 10 b B b
#>
#> [[3]]
#> # A tibble: 10 × 5
#> a b data1 data2 data3
#> <int> <int> <chr> <chr> <chr>
#> 1 1 1 c C c
#> 2 2 2 c C c
#> 3 3 3 c C c
#> 4 4 4 c C c
#> 5 5 5 c C c
#> 6 6 6 c C c
#> 7 7 7 c C c
#> 8 8 8 c C c
#> 9 9 9 c C c
#> 10 10 10 c C c
In base R, can use a for loop
for(i in seq_along(ltb)) ltb[[i]][names(tib)] <- tib[i,]
-output
> ltb
[[1]]
# A tibble: 10 × 5
a b data1 data2 data3
<int> <int> <chr> <chr> <chr>
1 1 1 a A a
2 2 2 a A a
3 3 3 a A a
4 4 4 a A a
5 5 5 a A a
6 6 6 a A a
7 7 7 a A a
8 8 8 a A a
9 9 9 a A a
10 10 10 a A a
[[2]]
# A tibble: 10 × 5
a b data1 data2 data3
<int> <int> <chr> <chr> <chr>
1 1 1 b B b
2 2 2 b B b
3 3 3 b B b
4 4 4 b B b
5 5 5 b B b
6 6 6 b B b
7 7 7 b B b
8 8 8 b B b
9 9 9 b B b
10 10 10 b B b
[[3]]
# A tibble: 10 × 5
a b data1 data2 data3
<int> <int> <chr> <chr> <chr>
1 1 1 c C c
2 2 2 c C c
3 3 3 c C c
4 4 4 c C c
5 5 5 c C c
6 6 6 c C c
7 7 7 c C c
8 8 8 c C c
9 9 9 c C c
10 10 10 c C c

How to balance a dataset in `dplyr` using `sample_n` automatically to the size of the smallest class?

I have a dataset like:
df <- tibble(
id = 1:18,
class = rep(c(rep(1,3),rep(2,2),3),3),
var_a = rep(c("a","b"),9)
)
# A tibble: 18 x 3
id cluster var_a
<int> <dbl> <chr>
1 1 1 a
2 2 1 b
3 3 1 a
4 4 2 b
5 5 2 a
6 6 3 b
7 7 1 a
8 8 1 b
9 9 1 a
10 10 2 b
11 11 2 a
12 12 3 b
13 13 1 a
14 14 1 b
15 15 1 a
16 16 2 b
17 17 2 a
18 18 3 b
That dataset contains a number of observations in several classes. The classes are not balanced. In the sample above we can see, that only 3 observations are of class 3, while there are 6 observations of class 2 and 9 observations of class 1.
Now I want to automatically balance that dataset so that all classes are of the same size. So I want a dataset of 9 rows, 3 rows in each class. I can use the sample_n function from dplyr to do such a sampling.
I achieved to do so by first calculating the smallest class size..
min_length <- as.numeric(df %>%
group_by(class) %>%
summarise(n = n()) %>%
ungroup() %>%
summarise(min = min(n)))
..and then apply the sample_n function:
set.seed(1)
df %>% group_by(cluster) %>% sample_n(min_length)
# A tibble: 9 x 3
# Groups: cluster [3]
id cluster var_a
<int> <dbl> <chr>
1 15 1 a
2 7 1 a
3 13 1 a
4 4 2 b
5 5 2 a
6 17 2 a
7 18 3 b
8 6 3 b
9 12 3 b
I wondered If it's possible to do that (calculating the smallest class size and then sampling) in one go?
You can do it in one step, but it is cheating a little:
set.seed(42)
df %>%
group_by(class) %>%
sample_n(min(table(df$class))) %>%
ungroup()
# # A tibble: 9 x 3
# id class var_a
# <int> <dbl> <chr>
# 1 1 1 a
# 2 8 1 b
# 3 15 1 a
# 4 4 2 b
# 5 5 2 a
# 6 11 2 a
# 7 12 3 b
# 8 18 3 b
# 9 6 3 b
I say "cheating" because normally you would not want to reference df$ from within the pipe. However, because they property we're looking for is of the whole frame but the table function only sees one group at a time, we need to side-step that a little.
One could do
df %>%
mutate(mn = min(table(class))) %>%
group_by(class) %>%
sample_n(mn[1]) %>%
ungroup()
# # A tibble: 9 x 4
# id class var_a mn
# <int> <dbl> <chr> <int>
# 1 14 1 b 3
# 2 13 1 a 3
# 3 7 1 a 3
# 4 4 2 b 3
# 5 16 2 b 3
# 6 5 2 a 3
# 7 12 3 b 3
# 8 18 3 b 3
# 9 6 3 b 3
Though I don't think that that is any more elegant/readable.

Split information from two columns, R, tidyverse

i've got some data in two columns:
# A tibble: 16 x 2
code niveau
<chr> <dbl>
1 A 1
2 1 2
3 2 2
4 3 2
5 4 2
6 5 2
7 B 1
8 6 2
9 7 2
My desired output is:
A tibble: 16 x 3
code niveau cat
<chr> <dbl> <chr>
1 A 1 A
2 1 2 A
3 2 2 A
4 3 2 A
5 4 2 A
6 5 2 A
7 B 1 B
8 6 2 B
I there a tidy way to convert these data without looping through it?
Here some dummy data:
data<-tibble(code=c('A', 1,2,3,4,5,'B', 6,7,8,9,'C',10,11,12,13), niveau=c(1, 2,2,2,2,2,1,2,2,2,2,1,2,2,2,2))
desired_output<-tibble(code=c('A', 1,2,3,4,5,'B', 6,7,8,9,'C',10,11,12,13), niveau=c(1, 2,2,2,2,2,1,2,2,2,2,1,2,2,2,2),
cat=c(rep('A', 6),rep('B', 5), rep('C', 5)))
Nicolas
Probably, you can create a new column cat and replace code values with NA where there is a number. We can then use fill to replace missing values with previous non-NA value.
library(dplyr)
data %>% mutate(cat = replace(code, grepl('\\d', code), NA)) %>% tidyr::fill(cat)
# A tibble: 16 x 3
# code niveau cat
# <chr> <dbl> <chr>
# 1 A 1 A
# 2 1 2 A
# 3 2 2 A
# 4 3 2 A
# 5 4 2 A
# 6 5 2 A
# 7 B 1 B
# 8 6 2 B
# 9 7 2 B
#10 8 2 B
#11 9 2 B
#12 C 1 C
#13 10 2 C
#14 11 2 C
#15 12 2 C
#16 13 2 C
We can use str_detect from stringr
library(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(cat = replace(code, str_detect(code, '\\d'), NA)) %>%
fill(cat)

restructuring multiple columns in R

Here is a sample of my data:
dat<-read.table(text=" id bx1 Z1A Z1B Z1C QR1 bx2 Z2A Z2B Z2C QR2
1 1 1 2 3 C 18 2 2 1 E
2 11 2 3 3 B 14 3 3 3 A
",header=TRUE)
I want to get the following table:
id bx Z QR Score
1 1 Z1A C 1
1 1 Z1B C 2
1 1 Z1C C 3
1 18 Z2A E 2
1 18 Z2B E 2
1 18 Z2C E 1
2 11 Z1A B 2
2 11 Z1B B 3
2 11 Z1C B 3
2 14 Z2A A 3
2 14 Z2B A 3
2 14 Z2C A 3
Assuming that I have more bxs and Zs and I have done this, but it does not work. I would like to do it with tidyverse or other pakages. I was unable to find out a solution.
df1<-melt(dat, id.var= "id")
Thanks for your help
In this case, we can use a left_join after separately doing the pivot_longer
library(dplyr)
library(tidyr)
library(stringr)
dat %>%
select(id, starts_with('Z')) %>%
pivot_longer(cols = starts_with('Z'), values_to = 'Score',
names_to = 'Z') %>%
group_by(id) %>%
mutate(group = as.character(as.integer(factor(str_remove(Z, "[A-Z]$"))))) %>%
left_join(dat %>%
select(id, matches('^[^Z]')) %>%
pivot_longer(cols = -id, names_to = c(".value", "group"),
names_pattern = "^([A-Za-z]+)([0-9]+)")) %>%
select(-group)
# A tibble: 12 x 5
# Groups: id [2]
# id Z Score bx QR
# <int> <chr> <int> <int> <fct>
# 1 1 Z1A 1 1 C
# 2 1 Z1B 2 1 C
# 3 1 Z1C 3 1 C
# 4 1 Z2A 2 18 E
# 5 1 Z2B 2 18 E
# 6 1 Z2C 1 18 E
# 7 2 Z1A 2 11 B
# 8 2 Z1B 3 11 B
# 9 2 Z1C 3 11 B
#10 2 Z2A 3 14 A
#11 2 Z2B 3 14 A
#12 2 Z2C 3 14 A
Or another option is to do a single pivot_longer and then fill the selected columns
dat %>%
pivot_longer(cols = -id, names_to = c(".value", "group"),
names_pattern = "^([A-Za-z]+)([0-9]+[A-Z]?)") %>%
group_by(id) %>%
fill(bx, QR) %>%
ungroup %>%
filter(!is.na(Z)) %>%
rename_at(vars(Z, group), ~ c('Score', 'Z')) %>%
mutate(Z = str_c('Z', Z))
# A tibble: 12 x 5
# id Z bx Score QR
# <int> <chr> <int> <int> <fct>
# 1 1 Z1A 1 1 C
# 2 1 Z1B 1 2 C
# 3 1 Z1C 1 3 C
# 4 1 Z2A 18 2 E
# 5 1 Z2B 18 2 E
# 6 1 Z2C 18 1 E
# 7 2 Z1A 11 2 B
# 8 2 Z1B 11 3 B
# 9 2 Z1C 11 3 B
#10 2 Z2A 14 3 A
#11 2 Z2B 14 3 A
#12 2 Z2C 14 3 A

Resources