adding grouping indicator for repeating sequences - r

I thought this is simple thing but failed and can't find answer from anywhere.
Example data looks like this. I have nro running from 1:x and restarts at random points. I would like to create ind variable which would be 1 for first run and 2 for second...
tbl <- tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)))
End result should look like this:
tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)),
ind = c(rep(1, 3), rep(2, 5), rep(3, 4)))
# A tibble: 12 x 2
nro ind
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 4 2
8 5 2
9 1 3
10 2 3
11 3 3
12 4 3
I thought I could do something with ifelse but failed miserably.
tbl %>%
mutate(ind = ifelse(nro < lag(nro), 1 + lag(ind), 1))
I assume this needs some kind of loop.

for sequences of the same length
You could use group_by on your nro variable and then just take the row_number():
tbl %>%
group_by(nro) %>%
mutate(ind = row_number())
# A tibble: 12 x 2
# Groups: nro [4]
# nro ind
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 1 2
# 6 2 2
# 7 3 2
# 8 4 2
# 9 1 3
# 10 2 3
# 11 3 3
# 12 4 3
for varying length of the sequences
inspired by docendo discimus's comment
tbl <- tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)))
tbl %>%
mutate(ind = cumsum(nro == 1))
However, this is limited to sequences which begin with 1, since only the TRUE values of nro == 1 are cumulated.
thus, you should consider to use this:
tbl %>% mutate(dif = nro - lag(nro)) %>%
mutate(dif = ifelse(is.na(dif), nro, dif)) %>%
mutate(ind = cumsum(dif < 0) + 1) %>%
select(-dif)
# A tibble: 12 x 2
# nro ind
# <int> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 4 2
# 8 5 2
# 9 1 3
# 10 2 3
# 11 3 3
# 12 4 3

Related

Roll max in R. From first row to current row

I would like to calculate max value from first row to current row
df <- data.frame(id = c(1,1,1,1,2,2,2), value = c(2,5,3,2,4,5,4), result = c(NA,2,5,5,NA,4,5))
I have tried grouping by id with dplyr and using rollmax function from zoo but did not success
1) rollmax is used with a fixed width but here we have a variable width so using rollapplyr, which seems close to the approach of the question, we have:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(out = lag(rollapplyr(value, 1:n(), max))) %>%
ungroup
giving:
# A tibble: 7 x 4
# Groups: id [2]
id value result out
<dbl> <dbl> <dbl> <dbl>
1 1 2 NA NA
2 1 5 2 2
3 1 3 5 5
4 1 2 5 5
5 2 4 NA NA
6 2 5 4 4
7 2 4 5 5
2) It is also possible to perform the grouping via the width (second) argument of rollapplyr like this eliminating dplyr. In this case the widths are 1, 2, 3, 4, 1, 2, 3 and Max is like max except it does not use the last element of its argument x. (An alternate expression for the width would be seq_along(id) - match(id, id) + 1).
library(zoo)
Max <- function(x) if (length(x) == 1) NA else max(head(x, -1))
transform(df, out = rollapplyr(value, sequence(rle(id)$lengths), Max))
giving:
id value result out
1 1 2 NA NA
2 1 5 2 2
3 1 3 5 5
4 1 2 5 5
5 2 4 NA NA
6 2 5 4 4
7 2 4 5 5
A data.table option using shift + cummax
> setDT(df)[, result2 := shift(cummax(value)), id][]
id value result result2
1: 1 2 NA NA
2: 1 5 2 2
3: 1 3 5 5
4: 1 2 5 5
5: 2 4 NA NA
6: 2 5 4 4
7: 2 4 5 5
library(dplyr)
df |>
group_by(id) |>
mutate(result = lag(cummax(value)))
# # A tibble: 7 x 3
# # Groups: id [2]
# id value result
# <dbl> <dbl> <dbl>
# 1 1 2 NA
# 2 1 5 2
# 3 1 3 5
# 4 1 2 5
# 5 2 4 NA
# 6 2 5 4
# 7 2 4 5
Here is a base R solution. This would just get you the cumulative maximum:
df$result = ave(df$value, df$i, FUN=cummax)
To get the cumulative maximum with the lag you wanted:
df$result = ave(df$value, df$i, FUN=function(x) c(NA,cummax(x[-(length(x))])))

How to create combinations of values of one variable by group using tidyverse in R

I am using the combn function in R to get all the combinations of the values of variable y taking each time 2 values, grouping by the values of x. My expected final result is the tibble c.
But when I try to do it in tidyverse something is (very) wrong.
library(tidyverse)
df <- tibble(x = c(1, 1, 1, 2, 2, 2, 2),
y = c(8, 9, 7, 3, 5, 2, 1))
# This is what I want
a <- combn(df$y[df$x == 1], 2)
a <- rbind(a, rep(1, ncol(a)))
b <- combn(df$y[df$x == 2], 2)
b <- rbind(b, rep(2, ncol(b)))
c <- cbind(a, b)
c <- tibble(c)
c <- t(c)
# but using tidyverse it does not work
df %>% group_by(x) %>% mutate(z = combn(y, 2))
#> Error: Problem with `mutate()` input `z`.
#> x Input `z` can't be recycled to size 3.
#> i Input `z` is `combn(y, 2)`.
#> i Input `z` must be size 3 or 1, not 2.
#> i The error occurred in group 1: x = 1.
Created on 2020-11-18 by the reprex package (v0.3.0)
Try with combn
out = df %>% group_by(x) %>% do(data.frame(t(combn(.$y, 2))))
# A tibble: 9 x 3
# Groups: x [2]
x X1 X2
<dbl> <dbl> <dbl>
1 1 8 9
2 1 8 7
3 1 9 7
4 2 3 5
5 2 3 2
6 2 3 1
7 2 5 2
8 2 5 1
9 2 2 1
If you have dplyr v1.0.2, you can do this
df %>% group_by(x) %>% group_modify(~as_tibble(t(combn(.$y, 2L))))
Output
# A tibble: 9 x 3
# Groups: x [2]
x V1 V2
<dbl> <dbl> <dbl>
1 1 8 9
2 1 8 7
3 1 9 7
4 2 3 5
5 2 3 2
6 2 3 1
7 2 5 2
8 2 5 1
9 2 2 1
An option with summarise and unnest
library(dplyr)
library(tidyr)
df %>%
group_by(x) %>%
summarise(y = list(as.data.frame(t(combn(y, 2)))), .groups = 'drop') %>%
unnest(c(y))
# A tibble: 9 x 3
# x V1 V2
# <dbl> <dbl> <dbl>
#1 1 8 9
#2 1 8 7
#3 1 9 7
#4 2 3 5
#5 2 3 2
#6 2 3 1
#7 2 5 2
#8 2 5 1
#9 2 2 1

Unnest or unchop dataframe containing lists of different lengths

I have a dataframe with several columns containing list columns that I want to unnest (or unchop). BUT, they are different lengths, so the resulting error is Error: No common size for...
Here is a reprex to show what works and doesn't work.
library(tidyr)
library(vctrs)
# This works as expected
df_A <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9)))
)
unchop(df_A, cols = c(A))
# A tibble: 7 x 2
ID A
<int> <dbl>
1 1 9
2 1 8
3 1 5
4 2 7
5 2 6
6 3 6
7 3 9
# This works as expected as the lists are the same lengths
df_AB_1 <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9))),
B = as_list_of(list(c(1, 2, 3), c(4, 5), c(7, 8)))
)
unchop(df_AB_1, cols = c(A, B))
# A tibble: 7 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 3
4 2 7 4
5 2 6 5
6 3 6 7
7 3 9 8
# This does NOT work as the lists are different lengths
df_AB_2 <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9))),
B = as_list_of(list(c(1, 2), c(4, 5, 6), c(7, 8, 9, 0)))
)
unchop(df_AB_2, cols = c(A, B))
# Error: No common size for `A`, size 3, and `B`, size 2.
The output that I would like to achieve for df_AB_2 above is as follows where each list is unchopped and missing values are filled with NA:
# A tibble: 10 x 3
ID A B
<dbl> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
I have referenced this issue on Github and StackOverflow here.
Any ideas how to achieve the result above?
Versions
> packageVersion("tidyr")
[1] ‘1.0.0’
> packageVersion("vctrs")
[1] ‘0.2.0.9001’
Here is an idea via dplyr that you can generalise to as many columns as you want,
library(tidyverse)
df_AB_2 %>%
pivot_longer(c(A, B)) %>%
mutate(value = lapply(value, `length<-`, max(lengths(value)))) %>%
pivot_wider(names_from = name, values_from = value) %>%
unnest() %>%
filter(rowSums(is.na(.[-1])) != 2)
which gives,
# A tibble: 10 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
Defining a helper function to update the lengths of the element and proceeding with dplyr:
foo <- function(x, len_vec) {
lapply(
seq_len(length(x)),
function(i) {
length(x[[i]]) <- len_vec[i]
x[[i]]
}
)
}
df_AB_2 %>%
mutate(maxl = pmax(lengths(A), lengths(B))) %>%
mutate(A = foo(A, maxl), B = foo(B, maxl)) %>%
unchop(cols = c(A, B)) %>%
select(-maxl)
# A tibble: 10 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
Using data.table:
library(data.table)
setDT(df_AB_2)
df_AB_2[, maxl := pmax(lengths(A), lengths(B))]
df_AB_2[, .(unlist(A)[seq_len(maxl)], unlist(B)[seq_len(maxl)]), by = ID]

Programmatically rename data frame columns using lookup data frame

What is the best way to batch rename columns using a lookup data frame?
Can I do it as part of a pipe?
library(tidyverse)
df <- data_frame(
a = seq(1, 10)
, b = seq(10, 1)
, c = rep(1, 10)
)
df_lookup <- data_frame(
old_name = c("b", "c", "a")
, new_name = c("y", "z", "x")
)
I know how to do it manually
df %>%
rename(x = a
, y = b
, z = c)
I am seeking a solution in tidyverse / dplyr packages.
Use rlang; Firstly build up a list of names using syms, and then splice the arguments to rename with UQS or !!! operator:
library(rlang); library(dplyr)
df %>% rename(!!!syms(with(df_lookup, setNames(old_name, new_name))))
# A tibble: 10 x 3
# x y z
# <int> <int> <dbl>
# 1 1 10 1
# 2 2 9 1
# 3 3 8 1
# 4 4 7 1
# 5 5 6 1
# 6 6 5 1
# 7 7 4 1
# 8 8 3 1
# 9 9 2 1
#10 10 1 1
You could write your own helper to make it easier
rename_to <- function(data, old, new) {
data %>% rename_at(old, function(x) new[old==x])
}
df %>% rename_to(df_lookup$old_name, df_lookup$new_name)
In base-R:
names(df)[match(df_lookup$old_name,names(df))] <- df_lookup$new_name
# # A tibble: 10 x 3
# x y z
# <int> <int> <dbl>
# 1 1 10 1
# 2 2 9 1
# 3 3 8 1
# 4 4 7 1
# 5 5 6 1
# 6 6 5 1
# 7 7 4 1
# 8 8 3 1
# 9 9 2 1
# 10 10 1 1
Using data.table:
library(data.table)
setnames(setDT(df), old = df_lookup$old_name, new = df_lookup$new_name)
# x y z
# 1: 1 10 1
# 2: 2 9 1
# 3: 3 8 1
# 4: 4 7 1
# 5: 5 6 1
# 6: 6 5 1
# 7: 7 4 1
# 8: 8 3 1
# 9: 9 2 1
# 10: 10 1 1

Merging dataframe every x row

I am trying to merge values in a dataframe by every nth row.
The data structure looks as follows:
id value
1 1
2 2
3 1
4 2
5 3
6 4
7 1
8 2
9 4
10 4
11 2
12 1
I like to aggregate the values for every 4 rows each. Actually, the dataset describes a measurement for each a 4-day period.
id"1" = day1,
id"2" = day2,
id"3" = day3,
id"4" = day4,
id"5" = day1,
...
As such, a column counting in a loop from 1 to 4 might be used?
The result should look like (sums):
day sum
1 8
2 10
3 4
4 5
This can be achieved with %% for creating a grouping variable and then do the sum with aggregate
n <- 4
aggregate(value ~cbind(day = (seq_along(df1$id)-1) %% n + 1), df1, FUN = sum)
# day value
#1 1 8
#2 2 10
#3 3 4
#4 4 5
This approach can also be used with dplyr/data.table
library(dplyr)
df1 %>%
group_by(day = (seq_along(id)-1) %% 4 +1) %>%
summarise(value = sum(value))
# day value
# <dbl> <int>
#1 1 8
#2 2 10
#3 3 4
#4 4 5
or
setDT(df1)[, .(value = sum(value)), .(day = (seq_along(id) - 1) %% 4 + 1)]
# day value
#1: 1 8
#2: 2 10
#3: 3 4
#4: 4 5
You need to make a sequence to group by, e.g.
rep(1:4, length = nrow(df))
## [1] 1 2 3 4 1 2 3 4 1 2 3 4
In aggregate:
aggregate(value ~ cbind(day = rep(1:4, length = nrow(df))), df, FUN = sum)
## day value
## 1 1 8
## 2 2 10
## 3 3 4
## 4 4 5
or dplyr:
library(dplyr)
df %>% group_by(day = rep(1:4, length = n())) %>% summarise(sum = sum(value))
## # A tibble: 4 x 2
## day sum
## <int> <int>
## 1 1 8
## 2 2 10
## 3 3 4
## 4 4 5
or data.table:
library(data.table)
setDT(df)[, .(sum = sum(value)), by = .(day = rep(1:4, length = nrow(df)))]
## day sum
## 1: 1 8
## 2: 2 10
## 3: 3 4
## 4: 4 5

Resources