Expanding a data.frame based on (group) values from the data.frame - r

Lets say I have the following data frame:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))
# A tibble: 2 x 3
user first last
<chr> <dbl> <dbl>
1 A 1 6
2 B 4 9
And want to create a tibble that looks like:
bind_rows(tibble(user = 'A', weeks = 1:6),
tibble(user = 'B', weeks = 4:9))
# A tibble: 12 x 2
user weeks
<chr> <int>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9
How could I go about doing this? I have tried:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9)) %>%
group_by(user) %>%
mutate(weeks = first:last)
I wonder if I should try a combination of complete map or nest?

One option is unnest after creating a sequence
library(dplyr)
library(purrr)
df1 %>%
transmute(user, weeks = map2(first, last, `:`)) %>%
unnest(weeks)
# A tibble: 12 x 2
# user weeks
# <chr> <int>
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
# 7 B 4
# 8 B 5
# 9 B 6
#10 B 7
#11 B 8
#12 B 9
Or another option is rowwise
df1 %>%
rowwise %>%
transmute(user, weeks = list(first:last)) %>%
unnest(weeks)
Or without any packages
stack(setNames(Map(`:`, df1$first, df1$last), df1$user))
Or otherwise written as
stack(setNames(do.call(Map, c(f = `:`, df1[-1])), df1$user))
data
df1 <- tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))

One option involving dplyr and tidyr could be:
df %>%
uncount(last - first + 1) %>%
group_by(user) %>%
transmute(weeks = first + 1:n() - 1)
user weeks
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9

Related

How to expand rows and fill in the numbers between given start and end

I have this data frame:
df <- tibble(x = c(1, 10))
x
<dbl>
1 1
2 10
I want this:
x
<int>
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Unfortunately I can't remember how I have to approach. I tried expand.grid, uncount, runner::fill_run.
Update: The real world data ist like this with groups and given start and end number. Here are only two groups:
df <- tibble(group = c("A", "A", "B", "B"),
x = c(10,30, 1, 10))
group x
<chr> <dbl>
1 A 10
2 A 30
3 B 1
4 B 10
We may need full_seq with either summarise or reframe or tidyr::complete
library(dplyr)
df %>%
group_by(group) %>%
reframe(x = full_seq(x, period = 1))
# or with
#tidyr::complete(x = full_seq(x, period = 1))
-output
# A tibble: 31 × 2
group x
<chr> <dbl>
1 A 10
2 A 11
3 A 12
4 A 13
5 A 14
6 A 15
7 A 16
8 A 17
9 A 18
10 A 19
# … with 21 more rows
A simple base R variation:
group <- c(rep("A", 21), rep("B ", 10))
x <- c(10:30, 1:10)
df <- tibble(group, x)
df
# A tibble: 31 × 2
group x
<chr> <int>
1 A 10
2 A 11
3 A 12
4 A 13
5 A 14
6 A 15
And here's an expand.grid solution:
g1 <- expand.grid(group = "A", x = 20:30)
g2 <- expand.grid(group = "B", x = 1:10)
df <- rbind(g1, g2)
df
group x
1 A 20
2 A 21
3 A 22
4 A 23
5 A 24
6 A 25
7 A 26
Using base:
stack(sapply(split(df$x, df$group), function(i) seq(i[ 1 ], i[ 2 ])))

Create new column based on previous column by group; if missing, use NA

I am trying out to select a value by group from one column, and pass it as value in another column, extending for the whole group. This is similar to question asked here . BUt, some groups do not have this number: in that case, I need to fill the column with NAs. How to do this?
Dummy example:
dd1 <- data.frame(type = c(1,1,1),
grp = c('a', 'b', 'd'),
val = c(1,2,3))
dd2 <- data.frame(type = c(2,2),
grp = c('a', 'b'),
val = c(8,2))
dd3 <- data.frame(type = c(3,3),
grp = c('b', 'd'),
val = c(7,4))
dd <- rbind(dd1, dd2, dd3)
Create new column:
dd %>%
group_by(type) %>%
mutate(#val_a = ifelse(grp == 'a', val , NA),
val_a2 = val[grp == 'a'])
Expected outcome:
type grp val val_a # pass in `val_a` value of teh group 'a'
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA # value for 'a' is missing from group 3
You were close with your first approach; use any to apply the condition to all observations in the group:
dd %>%
group_by(type) %>%
mutate(val_a = ifelse(any(grp == "a"), val[grp == "a"] , NA))
type grp val val_a
<dbl> <chr> <dbl> <dbl>
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA
Try this:
dd %>%
group_by(type) %>%
mutate(val_a2 = val[which(c(grp == 'a'))[1]])
# # A tibble: 7 x 4
# # Groups: type [3]
# type grp val val_a2
# <dbl> <chr> <dbl> <dbl>
# 1 1 a 1 1
# 2 1 b 2 1
# 3 1 d 3 1
# 4 2 a 8 8
# 5 2 b 2 8
# 6 3 b 7 NA
# 7 3 d 4 NA
This also controls against the possibility that there could be more than one match, which may cause bad results (with or without a warning).

dplyr: Mutate a new column with sequential repeated integers of n time in a dataframe

I am struggling with one maybe easy question. I have a dataframe of 1 column with n rows (n is a multiple of 3). I would like to add a second column with integers like: 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,.. How can I achieve this with dplyr as a general solution for different length of rows (all multiple of 3).
I tried this:
df <- tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(1:4, each=3))
This works. But I would like to have a solution for n rows, each = 3 . Many thanks!
You can specify each and length.out parameter in rep.
library(dplyr)
tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(row_number(), each=3, length.out = n()))
# Col1 Col2
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 2
# 7 7 3
# 8 8 3
# 9 9 3
#10 10 4
#11 11 4
#12 12 4
We can use gl
library(dplyr)
df %>%
mutate(col2 = as.integer(gl(n(), 3, n())))
As integer division i.e. %/% 3 over a sequence say 0:n will result in 0, 0, 0, 1, 1, 1, ... adding 1 will generate the desired sequence automatically, so simply this will also do
df %>% mutate(col2 = 1+ (row_number()-1) %/% 3)
# A tibble: 12 x 2
Col1 col2
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 2
5 5 2
6 6 2
7 7 3
8 8 3
9 9 3
10 10 4
11 11 4
12 12 4

Using pivot_longer with existing names_to column

Take an example dataframe like so (the real dataframe has more columns):
df <- data.frame(A = seq(1, 3, 1),
B = seq(4, 6, 1))
I can use pivot_longer to collect my columns of interest (A and B) like so:
library(dplyr)
library(tidyr)
df <- df %>%
pivot_longer(cols = c("A", "B"), names_to = "Letter", values_to = "Number")
df
Letter Number
<chr> <dbl>
1 A 1
2 B 4
3 A 2
4 B 5
5 A 3
6 B 6
Now let's say I have another column C in my dataframe, making it no longer tidy
C <- seq(7, 12, 1)
df_2 <- data.frame(df, C)
df_2
Letter Number C
1 A 1 7
2 B 4 8
3 A 2 9
4 B 5 10
5 A 3 11
6 B 6 12
I want to use pivot_longer again to make df_2 tidy and get this output:
data.frame(Letter = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Number = seq(1, 12, 1))
Letter Number
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
10 C 10
11 C 11
12 C 12
Using the same strategy creates an error though:
df_2 %>%
pivot_longer(cols = "C", names_to = "Letter", values_to = "Number")
Error: Failed to create output due to bad names.
* Choose another strategy with `names_repair`
Setting names_repair to minimal runs but doesn't produce the output I want.
Follow it like this
library(tidyverse)
df <- data.frame(A = seq(1, 3, 1),
B = seq(4, 6, 1))
df <- df %>%
pivot_longer(cols = c("A", "B"), names_to = "Letter", values_to = "Number")
C <- seq(7, 12, 1)
df_2 <- data.frame(C)
df_2 <- df_2 %>% pivot_longer(cols = C, names_to = "Letter", values_to = "Number")
df_result <- rbind(df, df_2)
Output
> df_result
# A tibble: 12 x 2
Letter Number
<chr> <dbl>
1 A 1
2 B 4
3 A 2
4 B 5
5 A 3
6 B 6
7 C 7
8 C 8
9 C 9
10 C 10
11 C 11
12 C 12
Maybe try this if it is helpful:
library(tidyverse)
#Code
df_2 %>% pivot_longer(everything()) %>%
arrange(name) %>% group_by(name) %>%
filter(!duplicated(value))
Output:
# A tibble: 12 x 2
# Groups: name [3]
name value
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
10 C 10
11 C 11
12 C 12
We could do this easily with stack
library(dplyr)
stack(df_2)[2:1] %>%
distinct %>%
set_names(c("Letter", "Number"))
-output
# Letter Number
#1 A 1
#2 A 2
#3 A 3
#4 B 4
#5 B 5
#6 B 6
#7 C 7
#8 C 8
#9 C 9
#10 C 10
#11 C 11
#12 C 12
Or an option with unnest/enframe
library(tidyr)
library(tibble)
unclass(df_2) %>%
enframe(name = "Letter", value = "Number") %>%
unnest(c(Number)) %>%
distinct
Or using melt
library(reshape2)
melt(df_2) %>%
distinct()
Or in a single line in base R
unique(stack(df_2)[2:1])

Label distinct combinations across multiple columns in R

I want to create a new column that labels each unique combination of values across x, y, z columns. My current work-around to achieve that is this:
> library(tidyverse)
>
> set.seed(100)
> df = tibble(x = sample.int(5, 50, replace = T), y = sample.int(5, 50, replace = T), z = sample.int(5, 50, replace = T))
> df
# A tibble: 50 x 3
x y z
<int> <int> <int>
1 2 4 4
2 3 4 4
3 1 3 5
4 2 1 4
5 4 2 5
6 4 5 2
7 2 3 4
8 3 5 4
9 2 4 1
10 5 5 2
# … with 40 more rows
>
> df2 = df %>% distinct(x,y,z) %>% rowid_to_column("unique_id") %>% left_join(df)
Joining, by = c("x", "y", "z")
> df2
# A tibble: 50 x 4
unique_id x y z
<int> <int> <int> <int>
1 1 2 4 4
2 2 3 4 4
3 3 1 3 5
4 4 2 1 4
5 4 2 1 4
6 5 4 2 5
7 5 4 2 5
8 6 4 5 2
9 6 4 5 2
10 7 2 3 4
# … with 40 more rows
What is a better/more efficient way to do this on a fairly large dataset? I'd like to stay within tidyverse but also open to other suggestions.
You could use rleidv from data.table
df$unique_id <- data.table::rleidv(df)
In dplyr, we can use group_indices function for this purpose which generates a unique id for each group of values.
library(dplyr)
df %>% mutate(unique_id = group_indices(., x, y, z))
In the devel version of dplyr, we can use cur_group_id
library(dplyr)
df %>%
group_by_all() %>%
mutate(unique_id = cur_group_id())
Or using .GRP from data.table
library(data.table)
setDT(df)[, unique_id := .GRP, names(df)]

Resources