R: add a dplyr group label as a number [duplicate] - r

This question already has answers here:
R - Group by variable and then assign a unique ID [duplicate]
(3 answers)
How to create a consecutive group number
(13 answers)
Closed 4 years ago.
I can not get my head around this must be simple task. How to get a group label as a consecutive number.
library(dplyr)
set.seed(1)
df <- data.frame(id = sample(c('a','b'), 20, T),
name = sample(c('N1', 'N2', 'N3'), 20, T),
val = runif(20)) %>%
group_by(id) %>%
arrange(id, name)
What I want is a label group_no that indicates the number of categories of the variable name within each id dplyr group. I can not find a solution in the dplyr package itself. Something like this:
# A tibble: 20 x 4
# Groups: id [2]
id name val group_no
<fct> <fct> <dbl> <int>
1 a N1 0.647 1
2 a N1 0.530 1
3 a N1 0.245 1
4 a N2 0.693 2
5 a N2 0.478 2
6 a N2 0.861 2
7 a N3 0.821 3
8 a N3 0.0995 3
9 a N3 0.662 3
10 b N1 0.553 1
11 b N1 0.0233 1
12 b N1 0.519 1
13 b N2 0.783 2
14 b N2 0.789 2
15 b N2 0.477 2
16 b N2 0.438 2
17 b N2 0.407 2
18 b N3 0.732 3
19 b N3 0.0707 3
20 b N3 0.316 3
Note, that the values of name could be anything and certainly are not normally suffixed by a number as in the example (otherwise I could do sub("^N", "", df$name).
I am looking for something a little different than the 1:n() solution in SO posts such as here.

I think in this case something as simple as :
df %>%
mutate(group_no = as.integer(name))
will work
# A tibble: 20 x 4
# Groups: id [2]
id name val group_no
<fct> <fct> <dbl> <int>
1 a N1 0.647 1
2 a N1 0.530 1
3 a N1 0.245 1
4 a N2 0.693 2
5 a N2 0.478 2
6 a N2 0.861 2
7 a N3 0.821 3
8 a N3 0.0995 3
9 a N3 0.662 3
10 b N1 0.553 1
11 b N1 0.0233 1
12 b N1 0.519 1
13 b N2 0.783 2
14 b N2 0.789 2
15 b N2 0.477 2
16 b N2 0.438 2
17 b N2 0.407 2
18 b N3 0.732 3
19 b N3 0.0707 3
20 b N3 0.316 3

We can do
df %>%
group_by(id) %>%
mutate(group_no = cumsum(c(TRUE, name[-1] != name[-n()])))
Or with match
df %>%
group_by(id) %>%
mutate(group_no = match(name, unique(name)))
# A tibble: 20 x 4
# Groups: id [2]
# id name val group_no
# <fct> <fct> <dbl> <int>
# 1 a N1 0.647 1
# 2 a N1 0.530 1
# 3 a N1 0.245 1
# 4 a N2 0.693 2
# 5 a N2 0.478 2
# 6 a N2 0.861 2
# 7 a N3 0.821 3
# 8 a N3 0.0995 3
# 9 a N3 0.662 3
#10 b N1 0.553 1
#11 b N1 0.0233 1
#12 b N1 0.519 1
#13 b N2 0.783 2
#14 b N2 0.789 2
#15 b N2 0.477 2
#16 b N2 0.438 2
#17 b N2 0.407 2
#18 b N3 0.732 3
#19 b N3 0.0707 3
#20 b N3 0.316 3

Here is a solution that uses left_join.
df %>%
left_join(df %>%
group_by(id, name) %>%
summarise(group_no = row_number()))

Related

Flag run-length of grouped intervals

I have a dataframe grouped by grp:
df <- data.frame(
v = rnorm(25),
grp = c(rep("A",10), rep("B",15)),
size = 2)
I want to flag the run-length of intervals determined by size. For example, for grp == "A", size is 2, and the number of rows is 10. So the interval should have length 10/2 = 5. This code, however, creates intervals with length 2:
df %>%
group_by(grp) %>%
mutate(
interval = (row_number() -1) %/% size)
# A tibble: 25 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <dbl>
1 -0.166 A 2 0
2 -1.12 A 2 0
3 0.941 A 2 1
4 -0.913 A 2 1
5 0.486 A 2 2
6 -1.80 A 2 2
7 -0.370 A 2 3
8 -0.209 A 2 3
9 -0.661 A 2 4
10 -0.177 A 2 4
# … with 15 more rows
How can I flag the correct run-length of the size-determined intervals? The desired output is this:
# A tibble: 25 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <dbl>
1 -0.166 A 2 0
2 -1.12 A 2 0
3 0.941 A 2 0
4 -0.913 A 2 0
5 0.486 A 2 0
6 -1.80 A 2 1
7 -0.370 A 2 1
8 -0.209 A 2 1
9 -0.661 A 2 1
10 -0.177 A 2 1
# … with 15 more rows
If I interpreted your question correctly, this small change should do the trick?
df %>%
group_by(grp) %>%
mutate(
interval = (row_number() -1) %/% (n()/size))
You can use gl:
df %>%
group_by(grp) %>%
mutate(interval = gl(first(size), ceiling(n() / first(size)))[1:n()])
output
# A tibble: 26 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <fct>
1 -1.12 A 2 1
2 3.04 A 2 1
3 0.235 A 2 1
4 -0.0333 A 2 1
5 -2.73 A 2 1
6 -0.0998 A 2 1
7 0.976 A 2 2
8 0.414 A 2 2
9 0.912 A 2 2
10 1.98 A 2 2
11 1.17 A 2 2
12 -0.509 B 2 1
13 0.704 B 2 1
14 -0.198 B 2 1
15 -0.538 B 2 1
16 -2.86 B 2 1
17 -0.790 B 2 1
18 0.488 B 2 1
19 2.17 B 2 1
20 0.501 B 2 2
21 0.620 B 2 2
22 -0.966 B 2 2
23 0.163 B 2 2
24 -2.08 B 2 2
25 0.485 B 2 2
26 0.697 B 2 2

How I can calculate correlation between each variable within each group in R using dplyr package?

Let's say i have data frame in R that looks like this :
var2 = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3),rep("F",3),
rep("H",3),rep("I",3))
y2 = c(-1.23, -0.983, 1.28, -0.268, -0.46, -1.23,
1.87, 0.416, -1.99, 0.289, 1.7, -0.455,
-0.648, 0.376, -0.887,0.534,-0.679,-0.923,
0.987,0.324,-0.783,-0.679,0.326,0.998);length(y2)
group2 = c(rep(1,6),rep(2,6),rep(3,6),rep(1,6))
data2 = tibble(var2,group2,y2)
with output :
# A tibble: 24 × 3
var2 group2 y2
<chr> <dbl> <dbl>
1 A 1 -1.23
2 A 1 -0.983
3 A 1 1.28
4 B 1 -0.268
5 B 1 -0.46
6 B 1 -1.23
7 C 2 1.87
8 C 2 0.416
9 C 2 -1.99
10 D 2 0.289
11 D 2 1.7
12 D 2 -0.455
13 E 3 -0.648
14 E 3 0.376
15 E 3 -0.887
16 F 3 0.534
17 F 3 -0.679
18 F 3 -0.923
19 H 1 0.987
20 H 1 0.324
21 H 1 -0.783
22 I 1 -0.679
23 I 1 0.326
24 I 1 0.998
i want to calculate the correlation of each distinct pair in R within each group using dplyr.
Ideally i want the resulted tibble to look like this (the 4th column to contain the values of each correlation pair):
which ideally must look like this :
group
var1
var2
value
1
A
B
cor(A,B)
1
A
H
cor(A,H)
1
A
I
cor(A,I)
1
B
H
cor(B,H)
1
B
I
cor(B,I)
1
H
I
cor(H,I)
2
C
D
cor(C,D)
3
E
F
cor(E,F)
How i can do that in R ?
Any help ?
A possible solution:
library(tidyverse)
data2 %>%
group_by(group2) %>%
group_split() %>%
map(\(x) x %>% group_by(var2) %>%
group_map(~ data.frame(.x[-1]) %>% set_names(.y)) %>%
bind_cols() %>% cor %>%
{data.frame(row = rownames(.)[row(.)[upper.tri(.)]],
col = colnames(.)[col(.)[upper.tri(.)]],
corr = .[upper.tri(.)])}) %>%
imap_dfr(~ data.frame(group = .y, .x))
#> group row col corr
#> 1 1 A B -0.9949738
#> 2 1 A H -0.9581357
#> 3 1 B H 0.9819901
#> 4 1 A I 0.8533855
#> 5 1 B I -0.9012948
#> 6 1 H I -0.9669093
#> 7 2 C D 0.4690460
#> 8 3 E F -0.1864518
if you are okay with repeating the functions you can do:
fun <- function(x, y){
a <- split(x, y)
col1 <- combn(names(a), 2, paste, collapse = '_')
col2 <- combn(unname(a), 2, do.call, what='cor')
data.frame(vars = col1, cor = col2)
}
data2 %>%
group_by(group2)%>%
summarise(fun(y2, var2), .groups = 'drop')
# A tibble: 8 x 3
# Groups: group2 [3]
group2 vars cor
<dbl> <chr> <dbl>
1 1 A_B -0.995
2 1 A_H -0.958
3 1 A_I 0.853
4 1 B_H 0.982
5 1 B_I -0.901
6 1 H_I -0.967
7 2 C_D 0.469
8 3 E_F -0.186
If you do not want to repeat the functions as the process might be expensive, you can do:
data2 %>%
group_by(group2)%>%
summarise(s=combn(split(y2, var2), 2,
\(x)stack(setNames(cor(x[[1]], x[[2]]), paste(names(x), collapse='_'))),
simplify = FALSE),.groups = 'drop') %>%
unnest(s)
# A tibble: 8 x 3
group2 values ind
<dbl> <dbl> <fct>
1 1 -0.995 A_B
2 1 -0.958 A_H
3 1 0.853 A_I
4 1 0.982 B_H
5 1 -0.901 B_I
6 1 -0.967 H_I
7 2 0.469 C_D
8 3 -0.186 E_F
Another option would be widyr::pairwise_cor which requires to first add an identifier for the "observation":
library(widyr)
library(dplyr)
data2 %>%
group_by(var2, group2) %>%
mutate(obs = row_number()) |>
ungroup() %>%
split(.$group2) %>%
lapply(function(x) widyr::pairwise_cor(x, var2, obs, y2, upper = FALSE)) %>%
bind_rows(.id = "group2")
#> # A tibble: 8 × 4
#> group2 item1 item2 correlation
#> <chr> <chr> <chr> <dbl>
#> 1 1 A B -0.995
#> 2 1 A H -0.958
#> 3 1 B H 0.982
#> 4 1 A I 0.853
#> 5 1 B I -0.901
#> 6 1 H I -0.967
#> 7 2 C D 0.469
#> 8 3 E F -0.186

TIdyverse mutate using value from column to reference another column value in perhaps a different row

I have created a tibble thus:
library(tidyverse)
set.seed(68)
a <- c(1, 2, 3, 4, 5)
b <- runif(5)
c <- c(1, 3, 3, 3, 1)
tib <- tibble(a, b, c)
which produces this
tib
# A tibble: 5 x 3
a b c
<dbl> <dbl> <dbl>
1 1 0.924 1
2 2 0.661 3
3 3 0.402 3
4 4 0.637 3
5 5 0.353 1
I would like to add another column, d, which is the value of b according to the a value given in column c. The resulting data frame should look thus:
a b c d
<dbl> <dbl> <dbl> <dbl>
1 1 0.924 1 0.924
2 2 0.661 3 0.402
3 3 0.402 3 0.402
4 4 0.637 3 0.402
5 5 0.353 1 0.924
Thanks for looking!
Use c to index the desired row of b:
tib %>% mutate(d = b[c])
a b c d
<dbl> <dbl> <dbl> <dbl>
1 1 0.924 1 0.924
2 2 0.661 3 0.402
3 3 0.402 3 0.402
4 4 0.637 3 0.402
5 5 0.353 1 0.924

Creating a new column using the previous value of a different column and the previous value of itself

how can I create a new column which starting value is 1 and the following values are a multiplication of the previous value of a column (b) and the previous value of itself (d)?
these data are only made up, but have the structure of my data:
> a <- rep(1:10, 3)
> b <- runif(30)
> c <- tibble(a,b)
> c
# A tibble: 30 x 2
a b
<int> <dbl>
1 1 0.945
2 2 0.280
3 3 0.464
4 4 0.245
5 5 0.917
6 6 0.913
7 7 0.144
8 8 0.481
9 9 0.873
10 10 0.754
# ... with 20 more rows
Then I try to calculate column d:
> c <- c %>%
+ group_by(a) %>%
+ mutate(d = accumulate(lag(b, k = 1), `*`, .init = 1))
and it should look like this
# A tibble: 30 x 3
# Groups: a [10]
a b d
<int> <dbl> <dbl>
1 1 0.945 1 <--- b[1] * d[1] = d[2]
2 2 0.280 0.945
3 3 0.464 0.265
4 4 0.245 0.123
5 5 0.917 0.03
#...
But instead I am getting this error message.
Fehler: Column `d` must be length 3 (the group size) or one, not 4
The problem is that when you initialize accumulate with .init = that adds an extra first element of the vector.
You could try this:
library(dplyr)
library(purrr)
c %>%
group_by(a) %>%
mutate(d = accumulate(b[(2:length(b))-1], `*`,.init=1)) %>%
arrange(a)
# a b d
# <int> <dbl> <dbl>
# 1 1 0.266 1
# 2 1 0.206 0.266
# 3 1 0.935 0.0547
# 4 2 0.372 1
# 5 2 0.177 0.372
# … with 25 more rows
Data
library(tibble)
set.seed(1)
a <- rep(1:10, 3)
b <- runif(30)
c <- tibble(a,b)
Using dplyr, I would do this:
c %>%
mutate(d = 1*accumulate(.x = b[-length(b)],
.init = 1,
.f = `*`))
# # A tibble: 30 x 3
# a b d
# <int> <dbl> <dbl>
# 1 1 0.562 1
# 2 2 0.668 0.562
# 3 3 0.100 0.375
# 4 4 0.242 0.0376
# 5 5 0.0646 0.00907
# 6 6 0.373 0.000586
# 7 7 0.664 0.000219
# 8 8 0.915 0.000145
# 9 9 0.848 0.000133
# 10 10 0.952 0.000113
# # ... with 20 more rows

Grouping by consecutive value occurrences

I came across a problem that forced me to use a loop instead of my preferred dplyr pipe flow.
I want to group rows based on consecutive observations of the same value.
For example, if the first four observations of type equal a, the first four observations should assigned to the same group. Order matters, so I can't dplyr::group_by and dplyr::summarize.
The code below should explain the problem fairly well. I was wondering if anyone could propose a less verbose way to do this, preferably using tidyverse packages, and not data.tables.
library(tidyverse)
# Crete some test data
df <- tibble(
id = 1:20,
type = c(rep("a", 5), rep("b", 5), rep("a", 5), rep("b", 5)),
val = runif(20)
)
df
#> # A tibble: 20 x 3
#> id type val
#> <int> <chr> <dbl>
#> 1 1 a 0.0606
#> 2 2 a 0.501
#> 3 3 a 0.974
#> 4 4 a 0.0833
#> 5 5 a 0.752
#> 6 6 b 0.0450
#> 7 7 b 0.367
#> 8 8 b 0.649
#> 9 9 b 0.846
#> 10 10 b 0.896
#> 11 11 a 0.178
#> 12 12 a 0.295
#> 13 13 a 0.206
#> 14 14 a 0.233
#> 15 15 a 0.851
#> 16 16 b 0.179
#> 17 17 b 0.801
#> 18 18 b 0.326
#> 19 19 b 0.269
#> 20 20 b 0.584
# Solve problem with a loop
count <- 1
df$consec_group <- NA
for (i in 1:nrow(df)) {
current <- df$type[i]
lag <- ifelse(i == 1, NA, df$type[i - 1])
lead <- ifelse(i == nrow(df), NA, df$type[i + 1])
if (lead %>% is.na) {
df$consec_group[i] <- ifelse(current == lag, count, count + 1)
} else {
df$consec_group[i] <- count
if (current != lead) count <- count + 1
}
}
df
#> # A tibble: 20 x 4
#> id type val consec_group
#> <int> <chr> <dbl> <dbl>
#> 1 1 a 0.0606 1
#> 2 2 a 0.501 1
#> 3 3 a 0.974 1
#> 4 4 a 0.0833 1
#> 5 5 a 0.752 1
#> 6 6 b 0.0450 2
#> 7 7 b 0.367 2
#> 8 8 b 0.649 2
#> 9 9 b 0.846 2
#> 10 10 b 0.896 2
#> 11 11 a 0.178 3
#> 12 12 a 0.295 3
#> 13 13 a 0.206 3
#> 14 14 a 0.233 3
#> 15 15 a 0.851 3
#> 16 16 b 0.179 4
#> 17 17 b 0.801 4
#> 18 18 b 0.326 4
#> 19 19 b 0.269 4
#> 20 20 b 0.584 4
Created on 2019-03-14 by the reprex package (v0.2.1)
This grouping of consecutive type occurrences is really just an intermediate step. My endgame is manipulate val for a given consec_group, based on the values of val that occurred within the previous consec_group. Advice on relevant packages would be appreciated.
You say "no data.tables", but are you sure? It's so *** fast and easy (in this case)...
library(data.table)
setDT(df)[, groupid := rleid(type)][]
# id type val groupid
# 1: 1 a 0.624078793 1
# 2: 2 a 0.687361541 1
# 3: 3 a 0.817702740 1
# 4: 4 a 0.669857208 1
# 5: 5 a 0.100977936 1
# 6: 6 b 0.418275823 2
# 7: 7 b 0.660119857 2
# 8: 8 b 0.876015209 2
# 9: 9 b 0.473562143 2
# 10: 10 b 0.284474633 2
# 11: 11 a 0.034154862 3
# 12: 12 a 0.391760387 3
# 13: 13 a 0.383107868 3
# 14: 14 a 0.729583433 3
# 15: 15 a 0.006288375 3
# 16: 16 b 0.530179235 4
# 17: 17 b 0.802643704 4
# 18: 18 b 0.409618633 4
# 19: 19 b 0.309363642 4
# 20: 20 b 0.021918512 4
If you insist on using the tidyverse/dplyr, you can (of course) still use the
rleid-function as follows:
df %>% mutate( groupid = data.table::rleid(type) )
benchmarks
on a larger sample
library(tidyverse)
library(data.table)
# Crete some large test data
df <- tibble(
id = 1:200000,
type = sample(letters[1:26], 200000, replace = TRUE),
val = runif(200000)
)
dt <- as.data.table(df)
microbenchmark::microbenchmark(
dplyr.rleid = df %>% mutate( groupid = data.table::rleid(type) ),
data.table.rleid = dt[, groupid := rleid(type)][],
rle = df %>% mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)}),
rle2 = df %>% mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths))),
transform = transform(df, ID = with(rle(df$type), rep(seq_along(lengths), lengths))),
times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr.rleid 3.153626 3.278049 3.410363 3.444949 3.502792 3.582626 10
# data.table.rleid 2.965639 3.065959 3.173992 3.145643 3.259672 3.507009 10
# rle 13.059774 14.042797 24.364176 26.126176 29.460561 36.874054 10
# rle2 12.641319 13.553846 30.951152 24.698338 34.139786 102.791719 10
# transform 12.330717 22.419128 22.725242 25.532084 26.187634 26.702794 10
You can use a rleid()-like possibility like this:
df %>%
mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)})
id type val ID_rleid
<int> <chr> <dbl> <int>
1 1 a 0.0430 1
2 2 a 0.858 1
3 3 a 0.504 1
4 4 a 0.318 1
5 5 a 0.469 1
6 6 b 0.144 2
7 7 b 0.173 2
8 8 b 0.0706 2
9 9 b 0.958 2
10 10 b 0.557 2
11 11 a 0.358 3
12 12 a 0.973 3
13 13 a 0.982 3
14 14 a 0.177 3
15 15 a 0.599 3
16 16 b 0.627 4
17 17 b 0.454 4
18 18 b 0.682 4
19 19 b 0.690 4
20 20 b 0.713 4
Or a modification (originally proposed by #d.b) that makes it more handy:
df %>%
mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths)))

Resources