How to mutate multiple columns as function of multiple columns systematically? - r

I have a tibble with a number of variables collected over time. A very simplified version of the tibble looks like this.
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
I want to systematically create a new set of variables varC so that varC.t# = varA.t# / varB.t# where # is 1, 2, 3, etc. (similarly to the way column names are setup in the tibble above).
How do I use something along the lines of mutate or across to do this?

You can do something like this with mutate(across..., however, for renaming columns there must be a shortcut.
df %>%
mutate(across(.cols = c(varA.t1, varA.t2),
.fns = ~ .x / get(glue::glue(str_replace(cur_column(), "varA", "varB"))),
.names = "V_{.col}")) %>%
rename_with(~str_replace(., "V_varA", "varC"), starts_with("V_"))
# A tibble: 2 x 7
id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 row_1 5 10 2 4 2.5 2.5
2 row_2 20 50 4 6 5 8.33
If there is a long time series you can also create a vector for .cols beforehand.

I have a package on GitHub called {dplyover} which aims to solve this kind of problem in way similar to dplyr::across.
The function is called across2. It lets you define two sets of columns to which you can apply one or several functions. The .names argument supports two glue specifictions: {pre} and {suf}. They extract the shared pre- and suffix of the variable names. This makes it easy to put nice names on our output variables.
The function has one caveat. It is not performant when applied to highly grouped data (there is a vignette with benchmarks).
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
df %>%
mutate(across2(starts_with("varA"),
starts_with("varB"),
~ .x / .y,
.names = "{pre}C.{suf}"))
#> # A tibble: 2 x 7
#> id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 row_1 5 10 2 4 2.5 2.5
#> 2 row_2 20 50 4 6 5 8.33
Created on 2021-04-10 by the reprex package (v0.3.0)

For such cases I find using base R easy and efficient.
varAcols <- sort(grep('varA', names(df), value = TRUE))
varBcols <- sort(grep('varB', names(df), value = TRUE))
df[sub('A', 'C', varAcols)] <- df[varAcols]/df[varBcols]
# id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 row_1 5 10 2 4 2.5 2.5
#2 row_2 20 50 4 6 5 8.33

Another way to do this with some customization is
Initial setup
library(dplyr)
library(purrr)
library(stringr)
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
# A function take in a formula then parse it and correct the column name
operation_function <- function(df, formula) {
# Extract the column name from the formula
new_column_name <- str_extract(formula, "^.+=")
new_column_name <- trimws(gsub("=", "", new_column_name))
# Process the df
df %>%
# parse the formula - this reuslt in new column name as value formula
mutate(!!rlang::parse_expr(formula)) %>%
# rename the new created column with the correct column name
rename(!!new_column_name := last_col())
}
Note: I think there should be more efficient way to implement the formula that have proper name. Though I couldn't figure it out right now. Welcome ideas from others
Prepare the formula to be process by the data. In this case it simple
For more complicated formula you may want to do it a little bit differently
# Prepare the formula
base_formula <- c("varC.t# = varA.t# / varB.t#")
replacement_list <- c(1, 2)
list_formula <- map(replacement_list, .f = gsub,
pattern = "#", x = base_formula)
list_formula
#> [[1]]
#> [1] "varC.t1 = varA.t1 / varB.t1"
#>
#> [[2]]
#> [1] "varC.t2 = varA.t2 / varB.t2"
Finally process the data with the list of formulas
# process with the function and then reduce them with left_join
reduce(map(.x = list_formula, .f = operation_function, df = df),
left_join)
#> Joining, by = c("id", "varA.t1", "varA.t2", "varB.t1", "varB.t2")
#> # A tibble: 2 x 7
#> id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 row_1 5 10 2 4 2.5 2.5
#> 2 row_2 20 50 4 6 5 8.33
Created on 2021-04-10 by the reprex package (v1.0.0)

Related

Dplyr "rename_with" giving unique column name error but function creates unique outputs

I created a function that creates unique column names from the existing ones: renameCol.
If I manually create a vector of new column names using that function I can manually set those as the new column names. However, if I use that function in rename_with I get an error about unique names.
library(tidyverse)
renameCol = function(colname)
{
match = str_match_all(colname, "HealthcareProvider((TaxonomyCode|PrimaryTaxonomySwitch))_([0-9]+)")[[1]]
coltype = match[[3]]
coltype = str_remove(coltype, "(Taxonomy|PrimaryTaxonomy)")
number = match[[4]]
return(paste0(coltype, "_", number))
}
renameCol("HealthcareProviderPrimaryTaxonomySwitch_11")
#> [1] "Switch_11"
renameCol("HealthcareProviderTaxonomyCode_11")
#> [1] "Code_11"
tb = tibble(
HealthcareProviderPrimaryTaxonomySwitch_11 = 1,
HealthcareProviderTaxonomyCode_3 = 2,
HealthcareProviderPrimaryTaxonomySwitch_9 = 3,
HealthcareProviderTaxonomyCode_13 = 4
)
tb %>% rename_with(renameCol)
#> Error in `rename_with()`:
#> ! Names must be unique.
#> x These names are duplicated:
#> * "Switch_11" at locations 1, 2, 3, and 4.
new_colnames = colnames(tb) %>% sapply(renameCol, USE.NAMES = F)
new_colnames
#> [1] "Switch_11" "Code_3" "Switch_9" "Code_13"
colnames(tb) = new_colnames
tb
#> # A tibble: 1 x 4
#> Switch_11 Code_3 Switch_9 Code_13
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 4
Created on 2022-06-16 by the reprex package (v2.0.1)
The answer is present in your question itself. Your function is not vectorised. It works for only one column name at a time.
library(tidyverse)
names(tb)
#[1] "HealthcareProviderPrimaryTaxonomySwitch_11"
#[2] "HealthcareProviderTaxonomyCode_3"
#[3] "HealthcareProviderPrimaryTaxonomySwitch_9"
#[4] "HealthcareProviderTaxonomyCode_13"
renameCol(names(tb))
#[1] "Switch_11"
Hence you have to use sapply to make it work for all the columns. rename_with is not a loop (like sapply) so to make it work you can do -
tb %>% rename_with(~sapply(., renameCol))
# A tibble: 1 × 4
# Switch_11 Code_3 Switch_9 Code_13
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 3 4
Or change the function to work with more than one column name.
renameCol = function(colname)
{
match = str_match_all(colname, "HealthcareProvider((TaxonomyCode|PrimaryTaxonomySwitch))_([0-9]+)")
match_data <- do.call(rbind, match)
coltype = match_data[, 3]
coltype = str_remove(coltype, "(Taxonomy|PrimaryTaxonomy)")
number = match_data[, 4]
return(paste0(coltype, "_", number))
}
renameCol(names(tb))
#[1] "Switch_11" "Code_3" "Switch_9" "Code_13"
tb %>% rename_with(renameCol)
# A tibble: 1 × 4
# Switch_11 Code_3 Switch_9 Code_13
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 3 4

change pattern of cols with same name in purrr::map_dfc R

I have a function that generates a dataframe with 2 cols (X and Y).
I want to use map_dfc but I would like to change the suffixes "...1", "...2" and so on that appear because the col names are the same
I would like something as (X_df1, Y_df1, X_df2, Y_df2, ...). Is there a suffix parameter? I've read the documentation and couldn't find
I don't want to use map_dfr because I need the dataframe to be wide.
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
map2_dfc(values$n1, values$n2, example_function)
gives me
A tibble: 1 x 4
X...1 Y...2 X...3 Y...4
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
And I want
A tibble: 1 x 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
Thanks!
If we don't want to change the function, we can rename before binding the cols - use pmap to loop over the rows the data, apply the function (example_function), loop over the list with imap, rename all the columns of the list of tibbles with the list index and then use bind_cols
library(dplyr)
library(purrr)
library(stringr)
pmap(values, example_function) %>%
imap(~ {nm1 <- str_c('_df', .y)
rename_with(.x, ~ str_c(., nm1), everything())
}) %>%
bind_cols
-output
# A tibble: 1 × 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
Or you could just build the new names first and apply them after you call map2_dfc():
library(purrr)
library(tibble)
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
new_names <- lapply(seq_len(ncol(values)), function(x) paste0(c("X", "Y"), "_df", x)) %>%
unlist()
map2_dfc(values$n1, values$n2, example_function) %>%
setNames(new_names)
#> New names:
#> * X -> X...1
#> * Y -> Y...2
#> * X -> X...3
#> * Y -> Y...4
#> # A tibble: 1 x 4
#> X_df1 Y_df1 X_df2 Y_df2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6 5 8 12
Created on 2022-04-08 by the reprex package (v2.0.1)

assigning id values from values, not names, with purrr::map_dfr

I think this question is related to Using map_dfr and .id for list names and list of list names but not identical ...
I often use map_dfr for a case where I want to use the value of each argument, not its name, as the .id variable. Here's a silly example: I am computing the mean of mtcars$mpg raised to the second, fourth, and sixth power:
library(tidyverse)
list(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
## name x
## <chr> <dbl>
## 1 1 439.
## 2 2 262350.
## 3 3 198039783.
I would like the name variable to be 2, 4, 6 instead of 1, 2, 3. I can hack this by including setNames(.data) in the pipeline:
list(2,4,6) %>%
setNames(.data) %>%
map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
but I wonder if there is a more idiomatic approach I'm missing?
As for the suggestion of using something like ~ tible(name=., ...): nice, but slightly less convenient for the case where the mapping function already returns a tibble, because we have to add an otherwise unnecessary tibble() call:
list(2, 4, 6) %>%
map_dfr(~ tibble(name=.,
broom::tidy(lm(mpg~cyl, data=mtcars, offset=rep(., nrow(mtcars))))))
OK, I think I found this shortly before posting (so I'll answer). This answer points out that tibble::lst() is a self-naming list function, so as long as we use tibble::lst(2,4,6) instead of list(2,4,6), it Just Works, e.g.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
This can work too:
library(tidyverse)
##ben Bolker answer.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="power")
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
list(2, 4, 6) %>% map_df(~ tibble(power = as.character(.x) , x = mean(mtcars$mpg^.)))
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
#another option
seq(2, 6, 2) %>% map2_df(rerun(length(.), mtcars$mpg), ~ c(x = as.character(.x), mean = round(mean(.y^.x), 0)))
#> # A tibble: 3 x 2
#> x mean
#> <chr> <chr>
#> 1 2 439
#> 2 4 262350
#> 3 6 198039783
Created on 2021-06-06 by the reprex package (v2.0.0)
This is also possible, however it would not have been my first choice and only a map would suffice:
library(purrr)
list(2, 4, 6) %>%
pmap_dfr(~ tibble(power = c(...), x = map_dbl(c(...), ~ mean(mtcars$mpg ^ .x))))
# A tibble: 3 x 2
power x
<dbl> <dbl>
1 2 439.
2 4 262350.
3 6 198039783.

Summarize variables beside

I am looking for a solution for my problem. I just can solve it with manually rearranging.
Example code:
library(dplyr)
set.seed(1)
Data <- data.frame(
W = sample(1:10),
X = sample(1:10),
Y = sample(c("yes", "no"), 10, replace = TRUE),
Z = sample(c("cat", "dog"), 10, replace = TRUE)
)
#
summarized <- Data %>% group_by(Z) %>% summarise_if(is.numeric,funs(mean,median),na.rm=T)
print(Data)
I want the output looks like below, with each function applied to the first col and then and each function applied to the second col and so on. My code does it vice versa.
Of course I could rearrange the cols but that is not what Data Science is about. I have hundreds of cols and want to apply multiple functions.
This is what I want:
summarized <- summarized[,c(1,2,4,3,5)] #best solution yet
Is there any argument I am missing? I bet there is an easy solution or an other function does the job.
Guys, thx in advance!
One option would be to post-process with adequate select_helpers
library(dplyr)
summarized %>%
select(Z, starts_with('W'), everything())
# A tibble: 2 x 5
# Z W_mean W_median X_mean X_median
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 cat 5.25 5.5 3.75 3.5
#2 dog 5.67 5.5 6.67 7
If there are 100s of columns, one approach is to get the substring of the column names, and order
library(stringr)
summarized %>%
select(Z, order(str_remove(names(.), "_.*")))
# A tibble: 2 x 5
# Z W_mean W_median X_mean X_median
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 cat 5.25 5.5 3.75 3.5
#2 dog 5.67 5.5 6.67 7
You can use starts_with() to select the columns, instead of by number.
library(dplyr)
set.seed(1)
Data <- data.frame(
W = sample(1:10),
X = sample(1:10),
Y = sample(c("yes", "no"), 10, replace = TRUE),
Z = sample(c("cat", "dog"), 10, replace = TRUE)
)
summarized <-
Data %>%
group_by(Z) %>%
summarise_if(is.numeric,funs(mean,median),na.rm=T) %>%
select(Z, starts_with("W_"), starts_with("X_"))
summarized
#> # A tibble: 2 x 5
#> Z W_mean W_median X_mean X_median
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 cat 5.25 5.5 3.75 3.5
#> 2 dog 5.67 5.5 6.67 7
Created on 2019-12-09 by the reprex package (v0.3.0)

Pass function arguments by column position to mutate_at

I'm trying to tighten up a %>% piped workflow where I need to apply the same function to several columns but with one argument changed each time. I feel like purrr's map or invoke functions should help, but I can't wrap my head around it.
My data frame has columns for life expectancy, poverty rate, and median household income. I can pass all these column names to vars in mutate_at, use round as the function to apply to each, and optionally supply a digits argument. But I can't figure out a way, if one exists, to pass different values for digits associated with each column. I'd like life expectancy rounded to 1 digit, poverty rounded to 2, and income rounded to 0.
I can call mutate on each column, but given that I might have more columns all receiving the same function with only an additional argument changed, I'd like something more concise.
library(tidyverse)
df <- tibble::tribble(
~name, ~life_expectancy, ~poverty, ~household_income,
"New Haven", 78.0580437642378, 0.264221051111753, 42588.7592521085
)
In my imagination, I could do something like this:
df %>%
mutate_at(vars(life_expectancy, poverty, household_income),
round, digits = c(1, 2, 0))
But get the error
Error in mutate_impl(.data, dots) :
Column life_expectancy must be length 1 (the number of rows), not 3
Using mutate_at instead of mutate just to have the same syntax as in my ideal case:
df %>%
mutate_at(vars(life_expectancy), round, digits = 1) %>%
mutate_at(vars(poverty), round, digits = 2) %>%
mutate_at(vars(household_income), round, digits = 0)
#> # A tibble: 1 x 4
#> name life_expectancy poverty household_income
#> <chr> <dbl> <dbl> <dbl>
#> 1 New Haven 78.1 0.26 42589
Mapping over the digits uses each of the digits options for each column, not by position, giving me 3 rows each rounded to a different number of digits.
df %>%
mutate_at(vars(life_expectancy, poverty, household_income),
function(x) map(x, round, digits = c(1, 2, 0))) %>%
unnest()
#> # A tibble: 3 x 4
#> name life_expectancy poverty household_income
#> <chr> <dbl> <dbl> <dbl>
#> 1 New Haven 78.1 0.3 42589.
#> 2 New Haven 78.1 0.26 42589.
#> 3 New Haven 78 0 42589
Created on 2018-11-13 by the reprex package (v0.2.1)
2 solutions
mutate with !!!
invoke was a good idea but you need it less now that most tidyverse functions support the !!! operator, here's what you can do :
digits <- c(life_expectancy = 1, poverty = 2, household_income = 0)
df %>% mutate(!!!imap(digits, ~round(..3[[.y]], .x),.))
# # A tibble: 1 x 4
# name life_expectancy poverty household_income
# <chr> <dbl> <dbl> <dbl>
# 1 New Haven 78.1 0.26 42589
..3 is the initial data frame, passed to the function as a third argument, through the dot at the end of the call.
Written more explicitly :
df %>% mutate(!!!imap(
digits,
function(digit, name, data) round(data[[name]], digit),
data = .))
If you need to start from your old interface (though the one I propose will be more flexible), first do:
digits <- setNames(c(1, 2, 0), c("life_expectancy", "poverty", "household_income"))
mutate_at and <<-
Here we bend a bit the good practice of avoiding <<- whenever possible, but readability matters and this one is really easy to read.
digits <- c(1, 2, 0)
i <- 0
df %>%
mutate_at(vars(life_expectancy, poverty, household_income), ~round(., digits[i<<- i+1]))
# A tibble: 1 x 4
# name life_expectancy poverty household_income
# <chr> <dbl> <dbl> <dbl>
# 1 New Haven 78.1 0.26 42589
(or just df %>% mutate_at(names(digits), ~round(., digits[i<<- i+1])) if you use a named vector as in my first solution)
Here's a map2 solution along the lines of Henrik's comment. You can then wrap this inside a custom function. I provided an rough first attempt but I have done minimal tests, so it probably breaks under all sorts of situations if evaluation is strange. It also doesn't use tidyselect for .at, but neither does modify_at...
library(tidyverse)
df <- tibble::tribble(
~name, ~life_expectancy, ~poverty, ~household_income,
"New Haven", 78.0580437642378, 0.264221051111753, 42588.7592521085,
"New York", 12.349685329, 0.324067934, 32156.230974623
)
rounded <- df %>%
select(life_expectancy, poverty, household_income) %>%
map2_dfc(
.y = c(1, 2, 0),
.f = ~ round(.x, digits = .y)
)
df %>%
select(-life_expectancy, -poverty, -household_income) %>%
bind_cols(rounded)
#> # A tibble: 2 x 4
#> name life_expectancy poverty household_income
#> <chr> <dbl> <dbl> <dbl>
#> 1 New Haven 78.1 0.26 42589
#> 2 New York 12.3 0.32 32156
modify2_at <- function(.x, .y, .at, .f) {
modified <- .x[.at] %>%
map2(.y, .f)
.x[.at] <- modified
return(.x)
}
df %>%
modify2_at(
.y = c(1, 2, 0),
.at = c("life_expectancy", "poverty", "household_income"),
.f = ~ round(.x, digits = .y)
)
#> # A tibble: 2 x 4
#> name life_expectancy poverty household_income
#> <chr> <dbl> <dbl> <dbl>
#> 1 New Haven 78.1 0.26 42589
#> 2 New York 12.3 0.32 32156
Created on 2018-11-13 by the reprex package (v0.2.1)
Fun with tidyeval:
prepared_pairs <-
map2(
set_names(syms(list("life_expectancy", "poverty", "household_income"))),
c(1, 2, 0),
~expr(round(!!.x, digits = !!.y))
)
mutate(df, !!! prepared_pairs)
# # A tibble: 1 x 4
# name life_expectancy poverty household_income
# <chr> <dbl> <dbl> <dbl>
# 1 New Haven 78.1 0.26 42589

Resources