Renaming multiple columns from vector in dplyr chain - r

I have a dataframe that I would like to rename several columns with similar name conventions (e.g., starts with "X") and/or column positions (e.g., 4:7). The new names of the columns are stored in a vector. How do I rename this columns in a dplyr chain?
# data
df <- tibble(RID = 1,Var1 = "A", Var2 = "B",old_name1 =4, old_name2 = 8, old_name3=20)
new_names <- c("new_name1","new_name2","new_name3")
#psuedo code
df %>%
rename_if(starts_with('old_name'), new_names)

An option with rename_at would be
df %>%
rename_at(vars(starts_with('old_name')), ~ new_names)
# A tibble: 1 x 6
# RID Var1 Var2 new_name1 new_name2 new_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0
But, it is possible to make a function that works with rename_if by creating a logical index on the column names
df %>%
rename_if(grepl("^old_name", names(.)), ~ new_names)
# A tibble: 1 x 6
# RID Var1 Var2 new_name1 new_name2 new_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0
The rename_if in general is checking at the values of the columns instead of the column names i.e.
new_names2 <- c('var1', 'var2')
df %>%
rename_if(is.character, ~ new_names2)
# A tibble: 1 x 6
# RID var1 var2 old_name1 old_name2 old_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0

Update dplyr 1.0.0
There is an addition to rename() by rename_with() which takes a function as input. This function can be function(x) return (new_names), in other words you use the purrr short form ~ new_names as the rename function.
This makes imho the most elegant dplyr expression.
# shortest & most elegant expression
df %>% rename_with(~ new_names, starts_with('old_name'))
# A tibble: 1 x 6
RID Var1 Var2 new_name1 new_name2 new_name3
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 A B 4 8 20

Related

change pattern of cols with same name in purrr::map_dfc R

I have a function that generates a dataframe with 2 cols (X and Y).
I want to use map_dfc but I would like to change the suffixes "...1", "...2" and so on that appear because the col names are the same
I would like something as (X_df1, Y_df1, X_df2, Y_df2, ...). Is there a suffix parameter? I've read the documentation and couldn't find
I don't want to use map_dfr because I need the dataframe to be wide.
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
map2_dfc(values$n1, values$n2, example_function)
gives me
A tibble: 1 x 4
X...1 Y...2 X...3 Y...4
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
And I want
A tibble: 1 x 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
Thanks!
If we don't want to change the function, we can rename before binding the cols - use pmap to loop over the rows the data, apply the function (example_function), loop over the list with imap, rename all the columns of the list of tibbles with the list index and then use bind_cols
library(dplyr)
library(purrr)
library(stringr)
pmap(values, example_function) %>%
imap(~ {nm1 <- str_c('_df', .y)
rename_with(.x, ~ str_c(., nm1), everything())
}) %>%
bind_cols
-output
# A tibble: 1 × 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
Or you could just build the new names first and apply them after you call map2_dfc():
library(purrr)
library(tibble)
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
new_names <- lapply(seq_len(ncol(values)), function(x) paste0(c("X", "Y"), "_df", x)) %>%
unlist()
map2_dfc(values$n1, values$n2, example_function) %>%
setNames(new_names)
#> New names:
#> * X -> X...1
#> * Y -> Y...2
#> * X -> X...3
#> * Y -> Y...4
#> # A tibble: 1 x 4
#> X_df1 Y_df1 X_df2 Y_df2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6 5 8 12
Created on 2022-04-08 by the reprex package (v2.0.1)

Pass column names as function arguments in formula

I want to create a re-usable function for a repeating t-test such that the column names can be passed into a formula. However, I cannot find a way to make it work. So the following code is the idea:
library(dplyr)
library(rstatix)
do.function <- function(table, column, category) {
column = sym(column)
category = sym(category)
stat.test <- table %>%
group_by(subset) %>%
t_test(column ~ category)
return(stat.test)
}
tmp = data.frame(id=seq(1:100), value = rnorm(100), subset = rep(c("Set1", "Set2"),each=50,2),categorical_value= rep(c("A", "B"),each=25,4))
do.function(table= tmp, column = "value", category = "categorical_value")
The current error that I get is the following:
Error: Can't extract columns that don't exist.
x Column `category` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.
The question is whether somebody knows how to solve this?
Just make a formula instead of wrapping them in sym:
library(dplyr)
library(rstatix)
do.function <- function(table, column, category) {
formula <- paste0(column, '~', category) %>%
as.formula()
table %>%
group_by(subset) %>%
t_test(formula)
}
tmp = data.frame(id=seq(1:100), value = rnorm(100), subset = rep(c("Set1", "Set2"),each=50,2),categorical_value= rep(c("A", "B"),each=25,4))
do.function(table= tmp, column = "value", category = "categorical_value")
# A tibble: 2 x 9
subset .y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Set1 value A B 50 50 0.484 94.3 0.63
2 Set2 value A B 50 50 -2.15 97.1 0.034
As we are passing string values, we may just use reformulate to create the expression in formula
do.function <- function(table, column, category) {
stat.test <- table %>%
group_by(subset) %>%
t_test(reformulate(category, response = column ))
return(stat.test)
}
-testing
> do.function(table= tmp, column = "value", category = "categorical_value")
# A tibble: 2 × 9
subset .y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Set1 value A B 50 50 1.66 97.5 0.0993
2 Set2 value A B 50 50 0.448 92.0 0.655
Formula actually is already used in rstatix::t_test, and we net to get the variables by their names.
do.function <- function(table, column, category) {
stat.test <- table %>%
mutate(column=get(column),
category=get(category)) %>%
rstatix::t_test(column ~ category)
return(stat.test)
}
do.function(table=tmp, column="value", category="categorical_value")
# # A tibble: 1 × 8
# .y. group1 group2 n1 n2 statistic df p
# * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
# 1 column A B 100 100 0.996 197. 0.32

Iterating over listed data frames within a piped purrr anonymous function call

Using purrr::map and the magrittr pipe, I am trying generate a new column with values equal to a substring of the existing column.
I can illustrate what I'm trying to do with the following toy dataset:
library(tidyverse)
library(purrr)
test <- list(tibble(geoid_1970 = c(123, 456),
name_1970 = c("here", "there"),
pop_1970 = c(1, 2)),
tibble(geoid_1980 = c(234, 567),
name_1980 = c("here", "there"),
pop_1970 = c(3, 4))
)
Within each listed data frame, I want a column equal to the relevant year. Without iterating, the code I have is:
data <- map(test, ~ .x %>% mutate(year = as.integer(str_sub(names(test[[1]][1]), -4))))
Of course, this returns a year of 1970 in both listed data frames, which I don't want. (I want 1970 in the first and 1980 in the second.)
In addition, it's not piped, and my attempt to pipe it throws an error:
data <- test %>% map(~ .x %>% mutate(year = as.integer(str_sub(names(.x[[1]][1]), -4))))
# > Error: Problem with `mutate()` input `year`.
# > x Input `year` can't be recycled to size 2.
# > ℹ Input `year` is `as.integer(str_sub(names(.x[[1]][1]), -4))`.
# > ℹ Input `year` must be size 2 or 1, not 0.
How can I iterate over each listed data frame using the pipe?
Try:
test %>% map(~.x %>% mutate(year = as.integer(str_sub(names(.x[1]), -4))))
[[1]]
# A tibble: 2 x 4
geoid_1970 name_1970 pop_1970 year
<dbl> <chr> <dbl> <int>
1 123 here 1 1970
2 456 there 2 1970
[[2]]
# A tibble: 2 x 4
geoid_1980 name_1980 pop_1970 year
<dbl> <chr> <dbl> <int>
1 234 here 3 1980
2 567 there 4 1980
We can get the 'year' with parse_number
library(dplyr)
library(purrr)
map(test, ~ .x %>%
mutate(year = readr::parse_number(names(.)[1])))
-output
#[[1]]
# A tibble: 2 x 4
# geoid_1970 name_1970 pop_1970 year
# <dbl> <chr> <dbl> <dbl>
#1 123 here 1 1970
#2 456 there 2 1970
#[[2]]
# A tibble: 2 x 4
# geoid_1980 name_1980 pop_1970 year
# <dbl> <chr> <dbl> <dbl>
#1 234 here 3 1980
#2 567 there 4 1980

Keeping excluded variables in summarise_at

I have a dataframe (this is just a subset of the full frame):
Depth <- seq(0, 2, 0.2)
cps <- sample(48000:52000, 11)
Al <- rnorm(11)
Si <- rnorm(11)
Fe <- rnorm(11)
df <- as_tibble(cbind(Depth, cps, Al, Si, Fe))
When I use mutate_at to perform a function for only chosen variables the final df still contains the variables I chose to exclude. So,
df_norm <- df %>%
mutate_at(vars(-c(Depth, cps)), ~abs(log(./df$cps)))
performs the function on Al, Si, Fe and df_norm is still a 11x5 tibble with Depth and cps being unchanged from df. However, when I do a similar move with summarise_at:
df_mean <- df %>%
summarise_at(vars(-c(Depth, cps)), mean)
the resulting dataframe is only 1x3 instead of 1x5 i.e. it removed Depth and cps instead of just ignoring them for the averaging. Is there a different way I should be writing the vars argument to keep these?
EDIT
I would like my output to be a single observation(vector) with all 5 variables [1,5] at the median Depth value (in this case 1).
In the devel version of dplyr, we can use summarise with across, but still not sure what values we want for 'Depth', 'cps', so it is converted to a list
library(dplyr)
df %>%
summarise(across(Al:Fe, mean), across(Depth:cps, list))
# A tibble: 1 x 5
# Al Si Fe Depth cps
# <dbl> <dbl> <dbl> <list> <list>
#1 -0.438 -0.118 -0.590 <dbl [11]> <dbl [11]>
Or to get the first row
df %>%
summarise(across(Al:Fe, mean), across(Depth:cps, first))
# A tibble: 1 x 5
# Al Si Fe Depth cps
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.438 -0.118 -0.590 0 51432
Or to subset the median element of 'Depth'
df %>%
summarise(across(Al:Fe, mean), across(Depth:cps, ~ .[Depth == median(Depth)]))
# A tibble: 1 x 5
# Al Si Fe Depth cps
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.438 -0.118 -0.590 1 51753
If we need the first row, then mutate and slice the first row
df %>%
mutate_at(vars(-c(Depth, cps)), mean) %>%
slice(1)
# A tibble: 1 x 5
# Depth cps Al Si Fe
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 0 51432 -0.438 -0.118 -0.590
Or if it needs to be the median row
df %>%
mutate_at(vars(-c(Depth, cps)), mean) %>%
filter(Depth == median(Depth))
# A tibble: 1 x 5
# Depth cps Al Si Fe
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 51753 -0.438 -0.118 -0.590

weighted mean in dplyr for multiple columns

I'm trying to calculate the weighted mean for multiple columns using dplyr. at the moment I'm stuck with summarize_each which to me seems to be part of the solution. here's some example code:
library(dplyr)
f2a <- c(1,0,0,1)
f2b <- c(0,0,0,1)
f2c <- c(1,1,1,1)
clustervar <- c("A","B","B","A")
weight <- c(10,20,30,40)
df <- data.frame (f2a, f2b, f2c, clustervar, weight, stringsAsFactors=FALSE)
df
what I am looking for is something like
df %>%
group_by (clustervar) %>%
summarise_each(funs(weighted.mean(weight)), select=cbind(clustervar, f2a:f2c))
The result of this is only:
# A tibble: 2 × 4
clustervar select4 select5 select6
<chr> <dbl> <dbl> <dbl>
1 A 25 25 25
2 B 25 25 25
What am I missing here?
You can use summarise_at to specify which columns you want to operate on:
df %>% group_by(clustervar) %>%
summarise_at(vars(starts_with('f2')),
funs(weighted.mean(., weight)))
#> # A tibble: 2 × 4
#> clustervar f2a f2b f2c
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 0.8 1
#> 2 B 0 0.0 1
We can reshape it to 'long' format and then do this
library(tidyverse)
gather(df, Var, Val, f2a:f2c) %>%
group_by(clustervar, Var) %>%
summarise(wt =weighted.mean(Val, weight)) %>%
spread(Var, wt)
Or another option is
df %>%
group_by(clustervar) %>%
summarise_each(funs(weighted.mean(., weight)), matches("^f"))
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
# 1 A 1 0.8 1
# 2 B 0 0.0 1
Or with summarise_at and matches (another variation of another post - didn't see the other post while posting)
df %>%
group_by(clustervar) %>%
summarise_at(vars(matches('f2')), funs(weighted.mean(., weight)))
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
#1 A 1 0.8 1
#2 B 0 0.0 1
Or another option is data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) weighted.mean(x, weight)),
by = clustervar, .SDcols = f2a:f2c]
# clustervar f2a f2b f2c
#1: A 1 0.8 1
#2: B 0 0.0 1
NOTE: All four answers are based on legitimate tidyverse/data.table syntax and would get the expected output
We can also create a function that makes use of the syntax from devel version of dplyr (soon to be released 0.6.0). The enquo does the similar job of substitute by taking the input arguments and converting it to quosures. Within the group_by/summarise/mutate, we evalute the quosure by unquoting (UQ or !!) it
wtFun <- function(dat, pat, wtcol, grpcol){
wtcol <- enquo(wtcol)
grpcol <- enquo(grpcol)
dat %>%
group_by(!!grpcol) %>%
summarise_at(vars(matches(pat)), funs(weighted.mean(., !!wtcol)))
}
wtFun(df, "f2", weight, clustervar)
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
#1 A 1 0.8 1
#2 B 0 0.0 1

Resources