restructuring multiple columns in R - r

Here is a sample of my data:
dat<-read.table(text=" id bx1 Z1A Z1B Z1C QR1 bx2 Z2A Z2B Z2C QR2
1 1 1 2 3 C 18 2 2 1 E
2 11 2 3 3 B 14 3 3 3 A
",header=TRUE)
I want to get the following table:
id bx Z QR Score
1 1 Z1A C 1
1 1 Z1B C 2
1 1 Z1C C 3
1 18 Z2A E 2
1 18 Z2B E 2
1 18 Z2C E 1
2 11 Z1A B 2
2 11 Z1B B 3
2 11 Z1C B 3
2 14 Z2A A 3
2 14 Z2B A 3
2 14 Z2C A 3
Assuming that I have more bxs and Zs and I have done this, but it does not work. I would like to do it with tidyverse or other pakages. I was unable to find out a solution.
df1<-melt(dat, id.var= "id")
Thanks for your help

In this case, we can use a left_join after separately doing the pivot_longer
library(dplyr)
library(tidyr)
library(stringr)
dat %>%
select(id, starts_with('Z')) %>%
pivot_longer(cols = starts_with('Z'), values_to = 'Score',
names_to = 'Z') %>%
group_by(id) %>%
mutate(group = as.character(as.integer(factor(str_remove(Z, "[A-Z]$"))))) %>%
left_join(dat %>%
select(id, matches('^[^Z]')) %>%
pivot_longer(cols = -id, names_to = c(".value", "group"),
names_pattern = "^([A-Za-z]+)([0-9]+)")) %>%
select(-group)
# A tibble: 12 x 5
# Groups: id [2]
# id Z Score bx QR
# <int> <chr> <int> <int> <fct>
# 1 1 Z1A 1 1 C
# 2 1 Z1B 2 1 C
# 3 1 Z1C 3 1 C
# 4 1 Z2A 2 18 E
# 5 1 Z2B 2 18 E
# 6 1 Z2C 1 18 E
# 7 2 Z1A 2 11 B
# 8 2 Z1B 3 11 B
# 9 2 Z1C 3 11 B
#10 2 Z2A 3 14 A
#11 2 Z2B 3 14 A
#12 2 Z2C 3 14 A
Or another option is to do a single pivot_longer and then fill the selected columns
dat %>%
pivot_longer(cols = -id, names_to = c(".value", "group"),
names_pattern = "^([A-Za-z]+)([0-9]+[A-Z]?)") %>%
group_by(id) %>%
fill(bx, QR) %>%
ungroup %>%
filter(!is.na(Z)) %>%
rename_at(vars(Z, group), ~ c('Score', 'Z')) %>%
mutate(Z = str_c('Z', Z))
# A tibble: 12 x 5
# id Z bx Score QR
# <int> <chr> <int> <int> <fct>
# 1 1 Z1A 1 1 C
# 2 1 Z1B 1 2 C
# 3 1 Z1C 1 3 C
# 4 1 Z2A 18 2 E
# 5 1 Z2B 18 2 E
# 6 1 Z2C 18 1 E
# 7 2 Z1A 11 2 B
# 8 2 Z1B 11 3 B
# 9 2 Z1C 11 3 B
#10 2 Z2A 14 3 A
#11 2 Z2B 14 3 A
#12 2 Z2C 14 3 A

Related

how to write a for loop to combine several data frame that are made with forward pipe operator in R?

I need to make a new dataframe but I don't know how to use for loop to reduce the repitition.
This is my original data frame
ID t1 t2 t4 t5 t6
1 S B 11 1 1
1 S B 11 2 0
1 S B 12 3 1
1 S B 12 4 1
1 S B 13 5 0
1 S B 14 6 1
1 S B 14 7 1
1 S B 15 8 0
2 S B 11 1 1
2 S B 12 2 1
2 S B 13 3 1
2 S B 14 4 0
2 S B 15 5 1
3 S G 11 1 1
3 S G 12 2 1
3 S G 12 3 0
3 S G 13 4 0
3 S G 14 5 1
3 S G 15 6 1
4 S G 11 1 1
4 S G 12 2 0
4 S G 13 3 0
4 S G 14 4 1
4 S G 15 5 0
5 N B 11 1 1
5 N B 12 2 1
5 N B 13 3 1
6 N B 11 1 1
6 N B 12 2 1
6 N B 13 3 1
6 N B 13 4 1
6 N B 14 5 0
6 N B 15 6 1
7 N G 11 1 0
7 N G 12 2 1
8 N G 11 1 0
8 N G 11 2 1
8 N G 11 3 0
8 N G 12 4 1
8 N G 12 5 0
8 N G 13 6 1
8 N G 13 7 1
8 N G 13 8 1
8 N G 14 9 1
8 N G 14 10 0
8 N G 15 11 1
8 N G 15 12 1
8 N G 15 13 0
8 N G 15 14 0
The following is the code I have written to extract my new data frames:
t=levels(as.factor(df$t4))
df11<- df %>%
filter(t4==11) %>%
group_by(ID) %>%
mutate(num=seq_along(ID)) %>%
as.data.frame
df.11.new<- df11 %>%
group_by(t2, num) %>%
summarise(mean=mean(t6), count=n())
df.11.new$t7="d11"
I need to repeat this code for all the levels of t4, which are "11", "12", "13", "14" and "15"
and finally combine them all like the following code:
df.all<-rbind(df.11.new, df.12.new, df.13.new, df.14.new, df.15.new)
But I don't know how to write a for loop?
Instead of filtering, add 't4' as grouping, then we don't need multiple filter in a loop and then rbind the outputs
library(stringr)
library(dplyr)
df.all <- df %>%
group_by(ID, t4) %>%
mutate(num = row_number()) %>%
group_by(t4, t2, num) %>%
summarise(mean = mean(t6), count = n(),
t7 = str_c('d', first(t4)), .groups = 'drop')
-checking with OP's output for t4 = 11
> df.all %>%
filter(t4 == 11)
# A tibble: 5 × 6
t4 t2 num mean count t7
<int> <chr> <int> <dbl> <int> <chr>
1 11 B 1 1 4 d11
2 11 B 2 0 1 d11
3 11 G 1 0.5 4 d11
4 11 G 2 1 1 d11
5 11 G 3 0 1 d11
> df.11.new
# A tibble: 5 × 4
# Groups: t2 [2]
t2 num mean count
<chr> <int> <dbl> <int>
1 B 1 1 4
2 B 2 0 1
3 G 1 0.5 4
4 G 2 1 1
5 G 3 0 1
If we use the rowid from data.table, can remove the first grouping
library(data.table)
df %>%
group_by(t4, t2, num = rowid(ID, t4)) %>%
summarise(mean = mean(t6), count = n(),
t7 = str_c('d', first(t4)), .groups = 'drop')

Rank subgroup by group (dplyr)

This question addresses how to assign the rank of a row within a group. I would like to assign the rank of a subgroup to a row within that subgroup. What I'm really getting at is that I need an abbreviation of the second group_by variable that is guaranteed to be unique, and this is the best way I can think of to go about doing that. Hopefully the desired output below makes this clear enough.
Input dataframe:
my_df <- tibble(
var1 = c(rep("A", 8), rep("B", 12)),
var2 = c(rep("long_string_x", 4),
rep("long_string_y", 4),
rep("long_string_x", 4),
rep("long_string_y", 4),
rep("long_string_z", 4))
)
Desired output:
# A tibble: 20 x 3
var1 var2 group_rank
<chr> <chr> <dbl>
1 A long_string_x 1
2 A long_string_x 1
3 A long_string_x 1
4 A long_string_x 1
5 A long_string_y 2
6 A long_string_y 2
7 A long_string_y 2
8 A long_string_y 2
9 B long_string_x 1
10 B long_string_x 1
11 B long_string_x 1
12 B long_string_x 1
13 B long_string_y 2
14 B long_string_y 2
15 B long_string_y 2
16 B long_string_y 2
17 B long_string_z 3
18 B long_string_z 3
19 B long_string_z 3
20 B long_string_z 3
How may I assign group_rank as above, ideally (but not necessarily) using a tidyverse approach?
We could use match after grouping
library(dplyr)
my_df %>%
group_by(var1) %>%
mutate(group_rank = match(var2, unique(var2))) %>%
ungroup
-output
# A tibble: 20 x 3
var1 var2 group_rank
<chr> <chr> <int>
1 A long_string_x 1
2 A long_string_x 1
3 A long_string_x 1
4 A long_string_x 1
5 A long_string_y 2
6 A long_string_y 2
7 A long_string_y 2
8 A long_string_y 2
9 B long_string_x 1
10 B long_string_x 1
11 B long_string_x 1
12 B long_string_x 1
13 B long_string_y 2
14 B long_string_y 2
15 B long_string_y 2
16 B long_string_y 2
17 B long_string_z 3
18 B long_string_z 3
19 B long_string_z 3
20 B long_string_z 3
using the approach to solving the problem of a respected #akrun
library(tidyverse)
my_df <- tibble(
var1 = c(rep("A", 8), rep("B", 12)),
var2 = c(rep("long_string_x", 4),
rep("long_string_y", 4),
rep("long_string_x", 4),
rep("long_string_y", 4),
rep("long_string_z", 4))
)
my_df %>%
group_by(var1) %>%
mutate(res = data.table::rleid(var2))
#> # A tibble: 20 x 3
#> # Groups: var1 [2]
#> var1 var2 res
#> <chr> <chr> <int>
#> 1 A long_string_x 1
#> 2 A long_string_x 1
#> 3 A long_string_x 1
#> 4 A long_string_x 1
#> 5 A long_string_y 2
#> 6 A long_string_y 2
#> 7 A long_string_y 2
#> 8 A long_string_y 2
#> 9 B long_string_x 1
#> 10 B long_string_x 1
#> 11 B long_string_x 1
#> 12 B long_string_x 1
#> 13 B long_string_y 2
#> 14 B long_string_y 2
#> 15 B long_string_y 2
#> 16 B long_string_y 2
#> 17 B long_string_z 3
#> 18 B long_string_z 3
#> 19 B long_string_z 3
#> 20 B long_string_z 3
Created on 2021-07-12 by the reprex package (v2.0.0)
Update:
As Greg pointed out (see comments) that group_by() default is .add = FALSE the intention was to use group_by twice -> then .add = TRUE should be added.
like:
library(dplyr)
my_df %>%
group_by(var1) %>%
mutate(group_rank = cur_group_id()) %>%
group_by(var2, .add=TRUE) %>%
mutate(group_rank = cur_group_id())
But in this case as Greg pointed out -> this is enough:
my_df %>% group_by(var2) %>% mutate(group_rank = cur_group_id())
First answer:
We could use cur_group_id() twice:
library(dplyr)
my_df %>%
group_by(var1) %>%
mutate(group_rank = cur_group_id()) %>%
group_by(var2) %>%
mutate(group_rank = cur_group_id())
Output:
var1 var2 group_rank
<chr> <chr> <int>
1 A long_string_x 1
2 A long_string_x 1
3 A long_string_x 1
4 A long_string_x 1
5 A long_string_y 2
6 A long_string_y 2
7 A long_string_y 2
8 A long_string_y 2
9 B long_string_x 1
10 B long_string_x 1
11 B long_string_x 1
12 B long_string_x 1
13 B long_string_y 2
14 B long_string_y 2
15 B long_string_y 2
16 B long_string_y 2
17 B long_string_z 3
18 B long_string_z 3
19 B long_string_z 3
20 B long_string_z 3

Split information from two columns, R, tidyverse

i've got some data in two columns:
# A tibble: 16 x 2
code niveau
<chr> <dbl>
1 A 1
2 1 2
3 2 2
4 3 2
5 4 2
6 5 2
7 B 1
8 6 2
9 7 2
My desired output is:
A tibble: 16 x 3
code niveau cat
<chr> <dbl> <chr>
1 A 1 A
2 1 2 A
3 2 2 A
4 3 2 A
5 4 2 A
6 5 2 A
7 B 1 B
8 6 2 B
I there a tidy way to convert these data without looping through it?
Here some dummy data:
data<-tibble(code=c('A', 1,2,3,4,5,'B', 6,7,8,9,'C',10,11,12,13), niveau=c(1, 2,2,2,2,2,1,2,2,2,2,1,2,2,2,2))
desired_output<-tibble(code=c('A', 1,2,3,4,5,'B', 6,7,8,9,'C',10,11,12,13), niveau=c(1, 2,2,2,2,2,1,2,2,2,2,1,2,2,2,2),
cat=c(rep('A', 6),rep('B', 5), rep('C', 5)))
Nicolas
Probably, you can create a new column cat and replace code values with NA where there is a number. We can then use fill to replace missing values with previous non-NA value.
library(dplyr)
data %>% mutate(cat = replace(code, grepl('\\d', code), NA)) %>% tidyr::fill(cat)
# A tibble: 16 x 3
# code niveau cat
# <chr> <dbl> <chr>
# 1 A 1 A
# 2 1 2 A
# 3 2 2 A
# 4 3 2 A
# 5 4 2 A
# 6 5 2 A
# 7 B 1 B
# 8 6 2 B
# 9 7 2 B
#10 8 2 B
#11 9 2 B
#12 C 1 C
#13 10 2 C
#14 11 2 C
#15 12 2 C
#16 13 2 C
We can use str_detect from stringr
library(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(cat = replace(code, str_detect(code, '\\d'), NA)) %>%
fill(cat)

Variable/column selection in tidyr fill()

Suppose a df with some missing values like this:
ID col_A_1 col_A_2 col_B_1 col_B_2
1 1 1 NA NA a
2 1 2 NA 1 b
3 1 3 1 2 c
4 1 4 2 3 d
5 1 NA 3 4 e
6 2 NA 1 5 f
7 2 NA 2 6 g
8 2 1 3 7 h
9 2 2 4 8 <NA>
10 2 3 5 NA <NA>
I want to fill the missing values using tidyr fill(), however, only the missing values in columns containing A.
I was able to achieve it using:
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
fill(names(.)[grepl("A", names(.))], .direction = "up") %>%
fill(names(.)[grepl("A", names(.))], .direction = "down") %>%
ungroup()
ID col_A_1 col_A_2 col_B_1 col_B_2
<dbl> <int> <int> <int> <chr>
1 1 1 1 NA a
2 1 2 1 1 b
3 1 3 1 2 c
4 1 4 2 3 d
5 1 4 3 4 e
6 2 1 1 5 f
7 2 1 2 6 g
8 2 1 3 7 h
9 2 2 4 8 <NA>
10 2 3 5 NA <NA>
however, I'm looking for other variable/column selection possibilities inside tidyr fill().
Sample data:
df <- data.frame(ID = c(rep(1, 5), rep(2, 5)),
col_A_1 = c(1:4, NA, NA, NA, 1:3),
col_A_2 = c(NA, NA, 1:3, 1:5),
col_B_1 = c(NA, 1:8, NA),
col_B_2 = c(letters[1:8], NA, NA),
stringsAsFactors = FALSE)
The fill can take select_helpers
library(tidyverse)
df %>%
group_by(ID) %>%
fill(matches('A'), .direction = 'up') %>%
fill(matches('A'), .direction = 'down')
# A tibble: 10 x 5
# Groups: ID [2]
# ID col_A_1 col_A_2 col_B_1 col_B_2
# <dbl> <int> <int> <int> <chr>
# 1 1 1 1 NA a
# 2 1 2 1 1 b
# 3 1 3 1 2 c
# 4 1 4 2 3 d
# 5 1 4 3 4 e
# 6 2 1 1 5 f
# 7 2 1 2 6 g
# 8 2 1 3 7 h
# 9 2 2 4 8 <NA>
#10 2 3 5 NA <NA>

How to group_by(everything())

I want to count unique combinations in a dataframe using dplyr
I tried the following:
require(dplyr)
set.seed(314)
dat <- data.frame(a = sample(1:3, 100, replace = T),
b = sample(1:2, 100, replace = T),
c = sample(1:2, 100, replace = T))
dat %>% group_by(a,b,c) %>% summarise(n = n())
But to make this generic (unrelated to the names of the columns) I tried:
dat %>% group_by(everything()) %>% summarise(n = n())
Which results in:
a b c n
<int> <int> <int> <int>
1 1 1 1 6
2 1 1 2 8
3 1 2 1 13
4 1 2 2 8
5 2 1 1 7
6 2 1 2 12
7 2 2 1 14
8 2 2 2 10
9 3 1 1 3
10 3 1 2 4
11 3 2 1 7
12 3 2 2 8
Which gives the error
Error in mutate_impl(.data, dots) : `c(...)` must be a character vector
I fiddled around with different things but cannot get it to work. I know I could use names(dat) but the columns in the dataframe that need to be in the group_by() are depended on previous steps in the dplyr chain.
There is a function called group_by_all() (and in the same sense group_by_at and group_by_if )which does exactly that.
library(dplyr)
dat %>%
group_by_all() %>%
summarise(n = n())
which gives the same result,
# A tibble: 12 x 4
# Groups: a, b [?]
a b c n
<int> <int> <int> <int>
1 1 1 1 6
2 1 1 2 8
3 1 2 1 13
4 1 2 2 8
5 2 1 1 7
6 2 1 2 12
7 2 2 1 14
8 2 2 2 10
9 3 1 1 3
10 3 1 2 4
11 3 2 1 7
12 3 2 2 8
PS
packageVersion('dplyr')
#[1] ‘0.7.2’
We can use .dots
dat %>%
group_by(.dots = names(.)) %>%
summarise(n = n())
# A tibble: 12 x 4
# Groups: a, b [?]
# a b c n
# <int> <int> <int> <int>
#1 1 1 1 6
#2 1 1 2 8
#3 1 2 1 13
#4 1 2 2 8
#5 2 1 1 7
#6 2 1 2 12
#7 2 2 1 14
#8 2 2 2 10
#9 3 1 1 3
#10 3 1 2 4
#11 3 2 1 7
#12 3 2 2 8
Another option would be to use the unquote, sym approach
dat %>%
group_by(!!! rlang::syms(names(.))) %>%
summarise(n = n())
In dplyr version 1.0.0 and later, you would now use across().
library(dplyr)
dat %>%
group_by(across(everything())) %>%
summarise(n = n())
Package version:
> packageVersion("dplyr")
[1] ‘1.0.5’

Resources