What is the best way to tidily append many columns? - r

I'm looking to append 30 columns which give values for gamma distributions by using the tidyverse. Here's an example of the data:
data.frame('rank'=1:3,'shape'=c(16,0.2,4),'rate'=c(13,0.4,0.2))
I'd like to use dgamma(1:30,shape,rate) to append 30 columns to the existing dataframe.

You can use map2() in purrr and unnest_wider() in tidyr.
library(tidyverse)
df %>%
mutate(density = map2(shape, rate, dgamma, x = 1:30)) %>%
unnest_wider(density, names_sep = "_")
Or use rowwise() at first and then mutate() with list().
df %>%
rowwise() %>%
mutate(density = list(dgamma(1:30, shape, rate))) %>%
unnest_wider(density, names_sep = "_")
Both of them give
# # A tibble: 3 x 33
# rank shape rate density_1 density_2 density_3 density_4 density_5 density_6 density_7
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 16 13 1.15 0.0852 0.0000843 1.43e-8 9.16e-13 3.19e-17 7.28e-22
# 2 2 0.2 0.4 0.122 0.0468 0.0227 1.21e-2 6.77e- 3 3.92e- 3 2.32e- 3
# 3 3 4 0.2 0.000218 0.00143 0.00395 7.67e-3 1.23e- 2 1.73e- 2 2.26e- 2
# # … with 23 more variables: density_8 <dbl>, density_9 <dbl>, density_10 <dbl>, ..., density_30 <dbl>

Related

Reordering columns using common names - dplyr

My data comes from a database which, depending on when I run my SQL query could contain different values for POS from one week to the other.
Not knowing which values will be in a variable makes it very hard to automate the creation of a report.
My data looks as follows:
sample <- data.frame(DRUG = c("A","A","B"),POS = c("Hospital","Physician","Home"),GROSS_COST = c(50,100,60), NET_COST = c(45,80,40))
I need to pivot this data frame wider so that there's a column for each pos by cost (gross & net).
This can be easily achieve using pivot_wider:
x <- sample %>% pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST))
Objective
I would like to be able to keep the columns for each POS together i.e. the GROSS_COST_Hospital and NET_COST_Hospital would be side by side, similar for all other POS columns.
Is there an elegant way to group columns using string matching?
Unfortunately, I don't think there is a direct solution to this (yet!). See https://github.com/tidyverse/tidyr/issues/839 .
For now you can get the data in long format so you can control their ordering the way you want.
library(tidyr)
sample %>%
pivot_longer(cols = c(GROSS_COST, NET_COST)) %>%
pivot_wider(names_from = c(name, POS), values_from = value)
# DRUG GROSS_COST_Hosp… NET_COST_Hospit… GROSS_COST_Phys… NET_COST_Physic…
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 50 45 100 80
#2 B NA NA NA NA
# … with 2 more variables: GROSS_COST_Home <dbl>, NET_COST_Home <dbl>
We can do an ordering on the select step
library(dplyr)
library(tidyr)
library(stringr)
sample %>%
pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST)) %>%
select(DRUG, names(.)[-1][order(str_extract(names(.)[-1], '[^_]+$'))])
# A tibble: 2 x 7
# DRUG GROSS_COST_Home NET_COST_Home GROSS_COST_Hospital NET_COST_Hospital GROSS_COST_Physician NET_COST_Physician
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A NA NA 50 45 100 80
#2 B 60 40 NA NA NA NA
A data.table option using dcast + melt
> dcast(melt(setDT(sample), id.vars = c("DRUG", "POS")), DRUG ~ variable + POS)
DRUG GROSS_COST_Home GROSS_COST_Hospital GROSS_COST_Physician NET_COST_Home
1: A NA 50 100 NA
2: B 60 NA NA 40
NET_COST_Hospital NET_COST_Physician
1: 45 80
2: NA NA
With the advent of tidyr 1.2.0, the issue is finally resolved, you may do this directly using names_vary argument
library(tidyr)
sample <- data.frame(DRUG = c("A","A","B"),POS = c("Hospital","Physician","Home"),GROSS_COST = c(50,100,60), NET_COST = c(45,80,40))
sample %>%
pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST), names_vary = 'slowest')
#> # A tibble: 2 x 7
#> DRUG GROSS_COST_Hospital NET_COST_Hospital GROSS_COST_Physi~ NET_COST_Physic~
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 50 45 100 80
#> 2 B NA NA NA NA
#> # ... with 2 more variables: GROSS_COST_Home <dbl>, NET_COST_Home <dbl>
Created on 2022-02-18 by the reprex package (v2.0.1)

Reshape long data table to list of wide data tables

My question is an expansion of the question posed here
How to reshape data from long to wide format so I will phrase it in a similar way.
The difference is that I want to rearrange one long data table into a list of wide data tables.
dat <- data.table(
sim = rep(c(1,2), each=4),
time = rep(1:4, 2),
value1 = rnorm(8),
value2 = rnorm(8)
)
dat
sim time value1 value2
1 1 1 0.3407 0.5167
2 1 2 -0.7033 0.8416
3 1 3 -0.3795 -0.4717
4 1 4 -0.7460 0.8479
5 2 1 0.8981 -0.7163
6 2 2 -0.3347 -0.6849
7 2 3 0.5013 0.8941
8 2 4 -0.1745 0.0795
I want to reshape it so that I have a list of wide data tables named value1, value2 ... value99 etc...
l = list()
l[["value1"]]
sim 1 2 3 4
1 1 0.3407 -0.7033 -0.3795 -0.7460
5 2 -0.8981 -0.3347 -0.5013 -0.1745
l[["value2"]]
sim 1 2 3 4
1 1 0.5167 0.8416 -0.4717 0.8479
5 2 -0.7163 -0.6849 0.8941 0.0795
Two variants.
data.table
library(data.table)
tmp <- dcast(melt(as.data.table(dat), id = c("sim", "time")), sim + variable ~ time)
tmp <- split(tmp, tmp$variable)
tmp <- lapply(tmp, set, i = NULL, j = "variable", value = NULL)
tmp
# $value1
# sim 1 2 3 4
# <num> <num> <num> <num> <num>
# 1: 1 1.0458737762 -0.4845954 0.1891288 0.05100633
# 2: 2 -0.0002406689 1.8093820 -0.8253280 1.14547045
# $value2
# sim 1 2 3 4
# <num> <num> <num> <num> <num>
# 1: 1 0.03157319 -0.8352058 -0.06876365 0.7467717
# 2: 2 -0.42551873 -0.7720822 0.15276411 0.9885968
I often use magrittr::%>% with data.table as well, so that can be converted into
library(data.table)
library(magrittr) # if %>% is not already available
as.data.table(dat) %>%
melt(., id = c("sim", "time")) %>%
dcast(., sim + variable ~ time) %>%
split(., .$variable) %>%
lapply(., set, i = NULL, j = "variable", value = NULL)
# $value1
# sim 1 2 3 4
# <num> <num> <num> <num> <num>
# 1: 1 1.0458737762 -0.4845954 0.1891288 0.05100633
# 2: 2 -0.0002406689 1.8093820 -0.8253280 1.14547045
# $value2
# sim 1 2 3 4
# <num> <num> <num> <num> <num>
# 1: 1 0.03157319 -0.8352058 -0.06876365 0.7467717
# 2: 2 -0.42551873 -0.7720822 0.15276411 0.9885968
tidyverse
library(dplyr)
library(tidyr) # pivot_longer, pivot_wider
dat %>%
pivot_longer(., -c(sim, time)) %>%
pivot_wider(., names_from = time, values_from = value) %>%
split(., .$name) %>%
lapply(., select, -name)
# $value1
# # A tibble: 2 x 5
# sim `1` `2` `3` `4`
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1.05 -0.485 0.189 0.0510
# 2 2 -0.000241 1.81 -0.825 1.15
# $value2
# # A tibble: 2 x 5
# sim `1` `2` `3` `4`
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.0316 -0.835 -0.0688 0.747
# 2 2 -0.426 -0.772 0.153 0.989
My solution to this issue would be to create a nested datafrae of the results. I have provided a brief description of the method followed by a reprex.
I would do this by using pivot_wider() and pivot_longer() to reshape the data. pivot_longer is used first to make each row only contain 1 value with a label for the time, simulation and whether it is value one or two. Then using pivot_wider each row will contain the values at each time with a label for the simulation and which set of values they are. (value1 or value2).
Finally we nest the dataframe using nest which stores all the data for each set of values in a dataframe. This can be accessed as an array of dataframes by nested_vals$data if necessary where nested_vals is the object we assigned the nested dataframe to.
library(tidyverse)
#Setup data
dat <- data.frame(
sim = rep(c(1,2), each=4),
time = rep(1:4, 2),
value1 = rnorm(8),
value2 = rnorm(8)
)
# Construct nested dataframe
nested_vals <- dat %>%
# Format dataset in tidy format
pivot_longer(cols = c(value1, value2)) %>%
# Move the name of the data to the beginning of the dataframe
relocate(name) %>%
# Pivot to matrix form as requested (i.e. times as columns, sims as rows)
pivot_wider(id_cols = c(name, sim), names_from = time, values_from = value) %>%
# Nest results by name
nest(-name)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(sim, `1`, `2`, `3`, `4`)`?
nested_vals
#> # A tibble: 2 x 2
#> name data
#> <chr> <list>
#> 1 value1 <tibble[,5] [2 x 5]>
#> 2 value2 <tibble[,5] [2 x 5]>
nested_vals$data[[2]]
#> # A tibble: 2 x 5
#> sim `1` `2` `3` `4`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.0639 0.250 -1.28 0.850
#> 2 2 -1.90 0.000421 0.704 -0.164
Created on 2021-04-07 by the reprex package (v2.0.0)
One more way, with a single pipe syntax
library(tidyverse)
dat %>% pivot_longer(c(value1, value2)) %>%
group_split(name) %>% setNames(map(., ~.x[[3]][1])) %>%
map(~ .x %>% pivot_wider(id_cols = sim, names_from = time, values_from = value))
$value1
# A tibble: 2 x 5
sim `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.851 -0.0484 -0.656 -0.121
2 2 -0.645 1.59 -0.274 0.445
$value2
# A tibble: 2 x 5
sim `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.46 -1.62 -0.672 1.43
2 2 1.65 0.790 0.495 0.162
Another approach:
library(dplyr)
library(tidyr)
wide_dat <- dat %>% pivot_wider(id_cols = sim, names_from = time, values_from = starts_with('value'))
lapply(lapply(split.default(wide_dat[-1], sub('_\\d','',names(wide_dat[-1]))), function(x) cbind(wide_dat[1],x)), setNames, c('sim', 1:4))
$value1
sim 1 2 3 4
1 1 -0.1704969 0.2820143 1.181898 2.2377396
2 2 2.1920534 0.8214070 0.421177 0.7601796
$value2
sim 1 2 3 4
1 1 0.1760887 0.3440053 -0.8435849 0.6729751
2 2 -0.1714095 1.5125986 -0.5739871 -0.9648294
A tidyverse solution could be:
library(dplyr)
library(purrr)
library(tidyr)
dat_longer <- dat %>%
tidyr::pivot_longer(starts_with("value"), names_to="col_name", values_to="values")
list_wide <- purrr::map(unique(dat_longer[["col_name"]]),
~dat_longer %>%
dplyr::filter(col_name==.x) %>%
tidyr::pivot_wider(values_from = "values", names_from="time") %>%
select(-col_name)) %>%
purrr::set_names(unique(dat_longer[["col_name"]]))
$value1
# A tibble: 2 x 5
sim `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.710 -0.334 -0.370 0.777
2 2 0.130 0.877 1.24 -0.202
$value2
# A tibble: 2 x 5
sim `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.719 -0.909 0.0821 -0.158
2 2 -0.706 1.51 0.234 1.09

conditionally mutating column values using `dplyr`

I am using WRS2 to carry out robust pairwise comparisons. But one problem is that it removes the group level names from the output dataframes and saves it in a different object.
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# comparisons
x$comp
#> Group Group psihat ci.lower ci.upper p.value
#> [1,] 1 2 -1.0 -3.440879 1.44087853 0.25984505
#> [2,] 1 3 -2.8 -5.536161 -0.06383861 0.04914871
#> [3,] 2 3 -1.8 -4.536161 0.93616139 0.17288911
# vector with group level names
x$fnames
#> [1] "placebo" "low" "high"
I can convert it to a tibble:
# converting to tibble
suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 -1 -3.44 1.44 0.260
#> 2 1 3 -2.8 -5.54 -0.0638 0.0491
#> 3 2 3 -1.8 -4.54 0.936 0.173
I would then like to replace the group column numeric values with actual names included in fnames (so map fnames[1] -> 1, fnames[2] -> 2, and so on).
So the final dataframe should look something like the following-
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
In this case, it was easy to just copy-paste the three values, but I want to have a generalizable approach where no matter the number of levels, it works. How can I do this using dplyr?
Using a named vector to match with tidyverse. This matches by value and not by the sequence of index i.e. if the value in 'Group' columns are not in a sequence or character, this would still work
library(dplyr)
as_tibble(x$comp, .name_repair = 'unique') %>%
mutate(across(starts_with("Group"),
~ setNames(x$fnames, seq_along(x$fnames))[as.character(.)]))
Does this fullfil your needs :
names <- c("A","B","C")
df = data.frame(group=c(1,2,3))
library(dplyr)
df %>% mutate(group = names[group])
group
1 A
2 B
3 C
Here's an approach using the recode function, with the recoding vector built programmatically from the data:
# Setup
set.seed(123)
library(WRS2)
library(tidyverse)
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# Create recoding vector
recode.vec = x$fnames %>% set_names(1:length(x$fnames))
# Recode columns
x.comp = x$comp %>%
as_tibble(.name_repair=make.unique) %>%
mutate(across(starts_with("Group"), ~recode(., !!!recode.vec)))
Output:
x.comp
#> # A tibble: 3 x 6
#> Group Group.1 psihat ci.lower ci.upper p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
Try this tidyverse approach formating data to long after extracting the objects as tibbles. You can use left_join() to get your groups as you want. Here the code to get something close to what you want:
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
#Transform to tibble
df1 <- suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#Extract labels
df2 <- tibble(treat=x$fnames) %>% mutate(value=1:n())
#Format to long df1
df1 <- df1 %>%
mutate(id=1:n()) %>%
pivot_longer(cols = c(group1,group2)) %>%
rename(group=name) %>% left_join(df2) %>% select(-value) %>%
pivot_wider(names_from = group,values_from=treat) %>% select(-id)
Output:
# A tibble: 3 x 6
psihat ci.lower ci.upper p.value group1 group2
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 -1 -3.44 1.44 0.260 placebo low
2 -2.8 -5.54 -0.0638 0.0491 placebo high
3 -1.8 -4.54 0.936 0.173 low high

left_join on a nested list

I have a nested df x and an unnested df y.
How can I join these two together so that the final output is a a single row with the id and val columns from x and a new column for each of the respective num values in order of appearance, num_1, num_2 ...?
library(tidyverse)
x <- tibble(id = list(letters[1:6]), val = 13)
x
#> # A tibble: 1 x 2
#> id val
#> <list> <dbl>
#> 1 <chr [6]> 13
y <- tibble(id = letters[1:6], num = rnorm(6))
y
#> # A tibble: 6 x 2
#> id num
#> <chr> <dbl>
#> 1 a 0.532
#> 2 b -0.106
#> 3 c -0.105
#> 4 d 0.973
#> 5 e -0.825
#> 6 f -0.951
map2(x, y, left_join, by = 'id')
Error in UseMethod("left_join"): no applicable method for 'left_join' applied to an object of class "list"
Created on 2020-08-14 by the reprex package (v0.3.0)
Edit: I'm looking for something loosely like this while still maintaining the ID column.
x %>%
unnest(id) %>%
left_join(y) %>%
mutate(n = row_number()) %>%
pivot_wider(id_cols = -id,
values_from = num,
names_from = n)
#> Joining, by = "id"
#> # A tibble: 1 x 7
#> val `1` `2` `3` `4` `5` `6`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 13 1.28 -0.387 -0.438 -0.0826 0.164 -1.24
Continuing with your work, you can try the following.
library(dplyr)
library(tidyr)
x %>%
unnest(id) %>%
left_join(y, by = "id") %>%
mutate(name = row_number(), id = list(id)) %>%
pivot_wider(values_from = num, names_glue = "num_{name}")
# # A tibble: 1 x 8
# id val num_1 num_2 num_3 num_4 num_5 num_6
# <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 <chr [6]> 13 1.28 -0.387 -0.438 -0.0826 0.164 -1.24
or
x %>%
mutate(num = map(id, ~ tibble::deframe(y) %>% .[match(names(.), .x)] %>% unname)) %>%
unnest_wider(num, names_sep = "_")
# # A tibble: 1 x 8
# id val num_1 num_2 num_3 num_4 num_5 num_6
# <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 <chr [6]> 13 1.28 -0.387 -0.438 -0.0826 0.164 -1.24
Description of the second solution
deframe() in tibble transforms a two-column data.frame to a named vector, the first column is converted to vector names and the second one is converted to vector values. deframe(y) %>% .[match(names(.), .x)] is equivalent to deframe(y)[match(names(deframe(y)), .x)]. The deframe(y) part appears twice, so I move it to the front of a pipe and use the . symbol to represent it behind the pipe. This line is to match the position of id columns of both data and reorder num column of y.
Based on your y you're not going to have multiple columns but adjusting the example a little, is this what you were aiming for?
x <- tibble(id = list(letters[1:6]), val = 13)
y <- tibble(id = rep(letters[1:6],2), num = rnorm(12),
name = paste0("num_", rep(1:2, each = 6)))
map_dfr(x$id[[1]], ~tibble(id = .x, val = x$val)) %>%
left_join(
pivot_wider(y, names_from = name, values_from = num)
)
#> Joining, by = "id"
#> # A tibble: 6 x 4
#> id val num_1 num_2
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 13 0.609 1.97
#> 2 b 13 0.956 -1.84
#> 3 c 13 0.425 0.297
#> 4 d 13 0.0379 -0.784
#> 5 e 13 -0.532 -0.769
#> 6 f 13 0.538 -1.10

weighted mean in dplyr for multiple columns

I'm trying to calculate the weighted mean for multiple columns using dplyr. at the moment I'm stuck with summarize_each which to me seems to be part of the solution. here's some example code:
library(dplyr)
f2a <- c(1,0,0,1)
f2b <- c(0,0,0,1)
f2c <- c(1,1,1,1)
clustervar <- c("A","B","B","A")
weight <- c(10,20,30,40)
df <- data.frame (f2a, f2b, f2c, clustervar, weight, stringsAsFactors=FALSE)
df
what I am looking for is something like
df %>%
group_by (clustervar) %>%
summarise_each(funs(weighted.mean(weight)), select=cbind(clustervar, f2a:f2c))
The result of this is only:
# A tibble: 2 × 4
clustervar select4 select5 select6
<chr> <dbl> <dbl> <dbl>
1 A 25 25 25
2 B 25 25 25
What am I missing here?
You can use summarise_at to specify which columns you want to operate on:
df %>% group_by(clustervar) %>%
summarise_at(vars(starts_with('f2')),
funs(weighted.mean(., weight)))
#> # A tibble: 2 × 4
#> clustervar f2a f2b f2c
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 0.8 1
#> 2 B 0 0.0 1
We can reshape it to 'long' format and then do this
library(tidyverse)
gather(df, Var, Val, f2a:f2c) %>%
group_by(clustervar, Var) %>%
summarise(wt =weighted.mean(Val, weight)) %>%
spread(Var, wt)
Or another option is
df %>%
group_by(clustervar) %>%
summarise_each(funs(weighted.mean(., weight)), matches("^f"))
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
# 1 A 1 0.8 1
# 2 B 0 0.0 1
Or with summarise_at and matches (another variation of another post - didn't see the other post while posting)
df %>%
group_by(clustervar) %>%
summarise_at(vars(matches('f2')), funs(weighted.mean(., weight)))
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
#1 A 1 0.8 1
#2 B 0 0.0 1
Or another option is data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) weighted.mean(x, weight)),
by = clustervar, .SDcols = f2a:f2c]
# clustervar f2a f2b f2c
#1: A 1 0.8 1
#2: B 0 0.0 1
NOTE: All four answers are based on legitimate tidyverse/data.table syntax and would get the expected output
We can also create a function that makes use of the syntax from devel version of dplyr (soon to be released 0.6.0). The enquo does the similar job of substitute by taking the input arguments and converting it to quosures. Within the group_by/summarise/mutate, we evalute the quosure by unquoting (UQ or !!) it
wtFun <- function(dat, pat, wtcol, grpcol){
wtcol <- enquo(wtcol)
grpcol <- enquo(grpcol)
dat %>%
group_by(!!grpcol) %>%
summarise_at(vars(matches(pat)), funs(weighted.mean(., !!wtcol)))
}
wtFun(df, "f2", weight, clustervar)
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
#1 A 1 0.8 1
#2 B 0 0.0 1

Resources