Using pivot_longer in tidyr with a complex separator [duplicate] - r

This question already has an answer here:
How to use Pivot_longer to reshape from wide-type data to long-type data with multiple variables
(1 answer)
Closed 2 years ago.
In a previous post here I tried to get the equivalent of an rbind using tidyr::pivotlonger(). This is the data and the solution.
set.seed(1)
df1 <- data.frame(group = rep(letters[1:2],each=3),
day = rep(1:3,2),
var1_mean = round(rnorm(6),2),
var1_sd = round(rnorm(6,5),2),
var2_mean = round(rnorm(6),2),
var2_sd = round(rnorm(6,5),2))
# group day var1_mean var1_sd var2_mean var2_sd
# 1 a 1 -0.63 5.49 -0.62 5.82
# 2 a 2 0.18 5.74 -2.21 5.59
# 3 a 3 -0.84 5.58 1.12 5.92
# 4 b 1 1.60 4.69 -0.04 5.78
# 5 b 2 0.33 6.51 -0.02 5.07
# 6 b 3 -0.82 5.39 0.94 3.01
df1 %>%
pivot_longer(cols = starts_with('var'),
names_to = c('grp', '.value'),
names_sep="_")
# group day grp mean sd
# <fct> <int> <chr> <dbl> <dbl>
# 1 a 1 var1 -0.63 5.49
# 2 a 1 var2 -0.62 5.82
# 3 a 2 var1 0.18 5.74
# 4 a 2 var2 -2.21 5.59
# 5 a 3 var1 -0.84 5.58
# 6 a 3 var2 1.12 5.92
# 7 b 1 var1 1.6 4.69
# 8 b 1 var2 -0.04 5.78
# 9 b 2 var1 0.33 6.51
# 10 b 2 var2 -0.02 5.07
# 11 b 3 var1 -0.82 5.39
# 12 b 3 var2 0.94 3.01
This solution is quite contingent on the naming convention used for the mean and sd variables. If there is a different naming convention, with a more complex separator between the two important nodes of the column names, like so...
df2 <- data.frame(group = rep(letters[1:2],each=3),
day = rep(1:3,2),
mean_var_1 = round(rnorm(6),2),
sd_var_1 = round(rnorm(6,5),2),
mean_var_2 = round(rnorm(6),2),
sd_var_2 = round(rnorm(6,5),2))
df2
# group day mean_var_1 sd_var_1 mean_var_2 sd_var_2
# 1 a 1 0.62 6.36 -0.39 5.70
# 2 a 2 -0.06 4.90 -0.06 5.56
# 3 a 3 -0.16 5.39 1.10 4.31
# 4 b 1 -1.47 4.95 0.76 4.29
# 5 b 2 -0.48 3.62 -0.16 5.36
# 6 b 3 0.42 4.59 -0.25 5.77
How would I achieve a similar result to the first example, with a single mean and sd column and with var_1 and var_2 as the grouping variable?

If you have names that are complicated you can use names_pattern argument where you can specify how each part of column name would be used to get data in long format.
tidyr::pivot_longer(df2,
cols = contains('var'),
names_to = c('.value', 'grp'),
names_pattern = '(.*?)_(.*)')
# group day grp mean sd
# <chr> <int> <chr> <dbl> <dbl>
# 1 a 1 var_1 0.62 6.36
# 2 a 1 var_2 -0.39 5.7
# 3 a 2 var_1 -0.06 4.9
# 4 a 2 var_2 -0.06 5.56
# 5 a 3 var_1 -0.16 5.39
# 6 a 3 var_2 1.1 4.31
# 7 b 1 var_1 -1.47 4.95
# 8 b 1 var_2 0.76 4.29
# 9 b 2 var_1 -0.48 3.62
#10 b 2 var_2 -0.16 5.36
#11 b 3 var_1 0.42 4.59
#12 b 3 var_2 -0.25 5.77
'(.*?)_(.*)' uses two groups of data where the first group is everything until the first underscore ((.*?)) in the column name and the second group is everything after the underscore following the first group ((.*)).

Related

Converting a list of named list to data frame

I have a list returned by sapply which looks like this:
> my_list
[,1] [,2] [,3] [,4]
val 1.73 2.73 4.71 5.27
cost 10.1 8.71 9.95 0.01
time 5.36 5.84 5.68 2.10
I'd like to convert it into a data frame:
id
val
cost
time
1
1.73
10.1
5.36
2
2.73
8.71
5.84
3
4.71
9.95
5.68
4
5.27
0.01
2.10
How can I transform the list into the data frame this way?
Edit: Here is the output of dput(my_list):
structure(list(1.73, 10.1, 5.36,2.73,8.71,5.84,
4.71,9.95,5.68, 5.27, 0.01, 2.10),
dim = c(3L, 4L), dimnames = list(c("val",
"cost", "time"), NULL))
Use t
t(dat) |>
transform(id = seq(ncol(dat)))
val cost time id
1 1.73 10.1 5.36 1
2 2.73 8.71 5.84 2
3 4.71 9.95 5.68 3
4 5.27 0.01 2.1 4
The elements in the data are list elements, if we want to make it regular vectors, an option is unnest
library(dplyr)
library(tidyr)
t(my_list) %>%
as_tibble %>%
unnest(where(is.list)) %>%
mutate(id = row_number(), .before = 1)
-output
# A tibble: 4 × 4
id val cost time
<int> <dbl> <dbl> <dbl>
1 1 1.73 10.1 5.36
2 2 2.73 8.71 5.84
3 3 4.71 9.95 5.68
4 4 5.27 0.01 2.1

Method in R to find difference between rows with varying row spacing

I want to add an extra column in a dataframe which displays the difference between certain rows, where the distance between the rows also depends on values in the table.
I found out that:
mutate(Col_new = Col_1 - lead(Col_1, n = x))
can find the difference for a fixed n, but only a integer can be used as input. How would you find the difference between rows for a varying distance between the rows?
I am trying to get the output in Col_new, which is the difference between the i and i+n row where n should take the value in column Count. (The data is rounded so there might be 0.01 discrepancies in Col_new).
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
Data:
df <- data.frame(Col_1 = c(0.90, 1.58, 1.89, 1.84, 1.57, 1.30, 1.35,
1.56, 2.24, 3.14, 4.04, 4.72, 5.04, 4.99,
4.71, 4.44, 4.39, 4.70, 5.38, 6.28),
Count = sort(rep(1:4, 5)))
Some code that generates the intended output, but can undoubtably be made more efficient.
library(dplyr)
df %>%
mutate(col_2 = sapply(1:4, function(s){lead(Col_1, n = s)})) %>%
rowwise() %>%
mutate(Col_new = Col_1 - col_2[Count]) %>%
select(-col_2)
Output:
# A tibble: 20 × 3
# Rowwise:
Col_1 Count Col_new
<dbl> <int> <dbl>
1 0.9 1 -0.68
2 1.58 1 -0.310
3 1.89 1 0.0500
4 1.84 1 0.27
5 1.57 1 0.27
6 1.3 2 -0.26
7 1.35 2 -0.89
8 1.56 2 -1.58
9 2.24 2 -1.8
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.0100
13 5.04 3 0.600
14 4.99 3 0.600
15 4.71 3 0.0100
16 4.44 4 -1.84
17 4.39 4 NA
18 4.7 4 NA
19 5.38 4 NA
20 6.28 4 NA
df %>% mutate(Col_new = case_when(
df$count == 1 ~ df$col_1 - lead(df$col_1 , n = 1),
df$count == 2 ~ df$col_1 - lead(df$col_1 , n = 2),
df$count == 3 ~ df$col_1 - lead(df$col_1 , n = 3),
df$count == 4 ~ df$col_1 - lead(df$col_1 , n = 4),
df$count == 5 ~ df$col_1 - lead(df$col_1 , n = 5)
))
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
This would give you your desired results but is not a very good solution for more cases. Imagine your task with 10 or more different counts another solution is required.

avoid repeated unquoting in dplyr non standard evaluation

Suppose we have the following data:
tib <- tibble::tibble(x = 1:10)
Then, suppose we want to make a function that takes a column as input and returns a tibble with several added columns such as:
library(dplyr)
generate_transformations <- function(data, column){
transform <- sym(column)
data %>%
mutate(
sqrt = sqrt(!!transform),
recip = 1 / !!transform,
log = log(!!transform)
)
}
# Usage is great:
tib %>%
generate_transformations('x')
# A tibble: 10 x 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30
Now my question is, is there a way to avoid unquoting (!!) transform repeatedly?
Yes, I could, e.g., temporarily rename column and then rename it back after I am done, but that is not my interest in this question.
I am interested if there is a way to produce a variable that does not need the !!.
While it does not work, I was looking for something like:
generate_transformations <- function(data, column){
transform <- !!sym(column) # cannot unquote here :(
data %>%
mutate(
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform)
)
}
Convert to string and subset from the data and use transform
generate_transformations <- function(data, column){
transform <- data[[rlang::as_string(ensym(column))]]
data %>%
mutate(
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform)
)
}
-testing
tib %>%
generate_transformations('x')
# A tibble: 10 × 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30
Or create a temporary column and remove it later
generate_transformations <- function(data, column){
data %>%
mutate(transform = !! rlang::ensym(column),
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform),
transform = NULL
)
}
-testing
tib %>%
generate_transformations('x')
# A tibble: 10 × 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30
You can do it in one, if you swap !! for {{}} and use across:
data_transformations <- function(d, col, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{col}}, .fns=funs))
}
d %>% data_transformations(x)
# A tibble: 10 × 4
x x_sqrt x_log x_recip
<int> <dbl> <dbl> <dbl>
1 1 1 0 1
2 2 1.41 0.693 0.5
3 3 1.73 1.10 0.333
4 4 2 1.39 0.25
5 5 2.24 1.61 0.2
6 6 2.45 1.79 0.167
7 7 2.65 1.95 0.143
8 8 2.83 2.08 0.125
9 9 3 2.20 0.111
10 10 3.16 2.30 0.1
To restore your original column names, use
data_transformations <- function(d, col, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{col}}, .fns=funs, .names="{.fn}"))
}
d %>% data_transformations(x)
# A tibble: 10 × 4
x sqrt log recip
<int> <dbl> <dbl> <dbl>
1 1 1 0 1
2 2 1.41 0.693 0.5
3 3 1.73 1.10 0.333
4 4 2 1.39 0.25
5 5 2.24 1.61 0.2
6 6 2.45 1.79 0.167
7 7 2.65 1.95 0.143
8 8 2.83 2.08 0.125
9 9 3 2.20 0.111
10 10 3.16 2.30 0.1
To handle multiple columns:
data_transformations <- function(d, cols, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{cols}}, .fns=funs))
}
d1 <- tibble(x=1:10, y=seq(2, 20, 2))
d1 %>% data_transformations(c(x, y), list(sqrt=sqrt, log=log))
A tibble: 10 × 6
x y x_sqrt x_log y_sqrt y_log
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0 1.41 0.693
2 2 4 1.41 0.693 2 1.39
3 3 6 1.73 1.10 2.45 1.79
4 4 8 2 1.39 2.83 2.08
5 5 10 2.24 1.61 3.16 2.30
6 6 12 2.45 1.79 3.46 2.48
7 7 14 2.65 1.95 3.74 2.64
8 8 16 2.83 2.08 4 2.77
9 9 18 3 2.20 4.24 2.89
10 10 20 3.16 2.30 4.47 3.00

Achieving the equivalent of rbind using tidyr [duplicate]

This question already has answers here:
tidyverse pivot_longer several sets of columns, but avoid intermediate mutate_wider steps [duplicate]
(3 answers)
Closed 2 years ago.
I have some data that looks like this.
set.seed(1)
df <- data.frame(group = rep(letters[1:2],each=3),
day = rep(1:3,2),
var1_mean = round(rnorm(6),2),
var1_sd = round(rnorm(6,5),2),
var2_mean = round(rnorm(6),2),
var2_sd = round(rnorm(6,5),2))
df
# output
# group day var1_mean var1_sd var2_mean var2_sd
# a 1 -0.63 5.49 -0.62 5.82
# a 2 0.18 5.74 -2.21 5.59
# a 3 -0.84 5.58 1.12 5.92
# b 1 1.60 4.69 -0.04 5.78
# b 2 0.33 6.51 -0.02 5.07
# b 3 -0.82 5.39 0.94 3.01
Now here is what I would like it to look like (and the code I used to get there)
library(tidyverse)
rbind(df %>% select(group, day, starts_with("var1")) %>% rename(mean = var1_mean, sd = var1_sd),
df %>% select(group, day, starts_with("var2")) %>% rename(mean = var2_mean, sd = var2_sd)) %>%
add_column(var = rep(paste0("var",1:2),each=6), .before = "group")
# output
# var group day mean sd
# var1 a 1 -0.63 5.49
# var1 a 2 0.18 5.74
# var1 a 3 -0.84 5.58
# var1 b 1 1.60 4.69
# var1 b 2 0.33 6.51
# var1 b 3 -0.82 5.39
# var2 a 1 -0.62 5.82
# var2 a 2 -2.21 5.59
# var2 a 3 1.12 5.92
# var2 b 1 -0.04 5.78
# var2 b 2 -0.02 5.07
# var2 b 3 0.94 3.01
Now my code obviously gets the job done but I was wondering if there is some way to use pivot_longer() or some other function to do it less clunkily.
We can use pivot_longer where we specify the names_sep as _ and the names_to with ".value" and a grouping name
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('var'),
names_to = c('grp', '.value'), names_sep="_")
# group day grp mean sd
# <chr> <int> <chr> <dbl> <dbl>
# 1 a 1 var1 -0.63 5.49
# 2 a 1 var2 -0.62 5.82
# 3 a 2 var1 0.18 5.74
# 4 a 2 var2 -2.21 5.59
# 5 a 3 var1 -0.84 5.58
# 6 a 3 var2 1.12 5.92
# 7 b 1 var1 1.6 4.69
# 8 b 1 var2 -0.04 5.78
# 9 b 2 var1 0.33 6.51
#10 b 2 var2 -0.02 5.07
#11 b 3 var1 -0.82 5.39
#12 b 3 var2 0.94 3.01
we could remove the 'grp' column later
df %>%
pivot_longer(cols = starts_with('var'),
names_to = c('grp', '.value'), names_sep="_") %>%
select(-grp)

Provide tibble names in purrr

I would like to know if it is possible to provide column names in the as_tibble function. I know that I could use the rename function to change column names, but I would like to save the number of lines I write. Lets say I want my column names to be a1, a2, a3.
> library(purrr)
> library(tidyverse)
> 1:3 %>%
+ map(~ rnorm(104, .x)) %>%
+ map_dfc(~as_tibble(.x))
# A tibble: 104 x 3
value value1 value2
<dbl> <dbl> <dbl>
1 2.91139409 1.44646163 1.298360
2 0.87725704 4.05341889 3.892296
3 0.73230088 2.72506579 3.520865
4 1.02862344 2.09576397 4.009980
5 0.49159059 -1.23746772 3.172201
6 0.24665840 1.80876495 2.927716
7 0.75112051 2.22486452 2.896452
8 -0.06036349 3.63503054 3.218324
9 1.84431314 1.88562406 2.398761
10 0.70866474 0.08947359 3.954770
# ... with 94 more rows
We can put as_tibble with map_dfc, and then use setNames(paste0("a", seq_len(ncol(.)))) to change column name based on the number of columns.
library(tidyverse)
set.seed(123)
1:3 %>%
map_dfc(~as_tibble(rnorm(104, .x))) %>%
setNames(paste0("a", seq_len(ncol(.))))
# A tibble: 104 x 3
a1 a2 a3
<dbl> <dbl> <dbl>
1 0.440 1.05 4.65
2 0.770 1.95 2.95
3 2.56 1.22 3.12
4 1.07 0.332 3.24
5 1.13 1.62 4.23
6 2.72 2.92 2.48
7 1.46 1.42 2.01
8 -0.265 2.61 4.68
9 0.313 0.382 2.56
10 0.554 1.94 2.28
# ... with 94 more rows

Resources