Renaming all dataframe columns with stringr and dplyr - r

I am trying to rename all columns in my data frame using dplyr and stringr, but it seems not to be working the way I desire. How should I change the following code to get the output I want (shown in the code below)?
Here is the fully reproducible code:
library(dplyr)
library(stringr)
library(tibble)
library(rlang)
# dataframe
x <-
tibble::as.tibble(cbind(
Grace_neu_wrong = c(1:4),
Grace_acc_wrong = c(1:4),
Grace_att_wrong = c(1:4),
Grace_int_wrong = c(1:4)
))
# defining custom function to rename the entire dataframe in a certain way
string_conversion <- function(df, ...) {
# preparing the dataframe
df <- dplyr::select(.data = df,
!!rlang::quo(...))
# custom function to split the name of each column in a certain way
splitfn <- function(x) {
x <- as.character(x)
split <- stringr::str_split(string = x, pattern = "_")[[1]]
paste(split[2], split[3], '_', split[1], sep = '')
}
# applying the splitfn function to each column name and outputting the data frame
df_new <- df %>%
dplyr::select_all(.funs = colnames) %>%
dplyr::mutate_all(.funs = splitfn)
return(df_new)
}
# the output I get
string_conversion(df = x, names(x))
#> # A tibble: 4 x 4
#> Grace_neu_wrong Grace_acc_wrong Grace_att_wrong Grace_int_wrong
#> <chr> <chr> <chr> <chr>
#> 1 NANA_1 NANA_1 NANA_1 NANA_1
#> 2 NANA_1 NANA_1 NANA_1 NANA_1
#> 3 NANA_1 NANA_1 NANA_1 NANA_1
#> 4 NANA_1 NANA_1 NANA_1 NANA_1
# the output I desire
tibble::as.tibble(cbind(
neuwrong_Grace = c(1:4),
accwrong_Grace = c(1:4),
attwrong_Grace = c(1:4),
intwrong_Grace = c(1:4)
))
#> # A tibble: 4 x 4
#> neuwrong_Grace accwrong_Grace attwrong_Grace intwrong_Grace
#> <int> <int> <int> <int>
#> 1 1 1 1 1
#> 2 2 2 2 2
#> 3 3 3 3 3
#> 4 4 4 4 4
Created on 2018-02-08 by the reprex
package (v0.1.1.9000).

You can do this in a single line without using mutate, which should be for the column values rather than the column names. Instead, do the following using stringr::str_replace and regular expressions.
The pattern "(.*)_(.*)_(.*)" is three groups of characters separated by underscores.
We simply make the replacement "\\2\\3_\\1", which is group 2, then group 3, then an underscore, then group 1, giving us the desired result.
The code is consequently just one line long:
names(x) <- str_replace(names(x), "(.*)_(.*)_(.*)", "\\2\\3_\\1")
print(x)
# A tibble: 4 x 4
neuwrong_Grace accwrong_Grace attwrong_Grace intwrong_Grace
<int> <int> <int> <int>
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4

Related

How to use a function with mutable number of arguments in R

I have two tibbles with different number of columns. I want to filter df1 using a value from column b and I also want to filter df2 using a value from column b and also column c. Is it possible to do this using the same function?
I followed the list(...) procedure, but of course, I got an error since, in the first case there is no x[[2]].
library(dplyr)
df1 <- tibble(a = c(4,2,3,4),
b = c(8,6,7,8))
df2 <- tibble(a = c(1,2,3,4),
b = c(5,6,7,8),
c = c(1,5,3,7))
df1
#> # A tibble: 4 × 2
#> a b
#> <dbl> <dbl>
#> 1 4 8
#> 2 2 6
#> 3 3 7
#> 4 4 8
df2
#> # A tibble: 4 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 1 5 1
#> 2 2 6 5
#> 3 3 7 3
#> 4 4 8 7
createTable <- function(df, ...) {
x <- list(...)
tabl <- df %>%
filter(b < x[[1]], c < x[[2]])
return(tabl)
}
tabl1 <- createTable(df1, 8)
#> Error in `filter()`:
#> ! Problem while computing `..2 = c < x[[2]]`.
#> Caused by error in `x[[2]]`:
#> ! subscript out of bounds
tabl2 <- createTable(df2, 7, 5)
Created on 2022-07-27 by the reprex package (v2.0.1)

How to use an expression in dplyr::mutate in R

I want to add a new column based on a given character vector.
For example, in the example below, I want to add column d defined in expr:
library(magrittr)
data <- tibble::tibble(
a = c(1, 2),
b = c(3, 4)
)
expr <- "d = a + b"
just as below:
data %>%
dplyr::mutate(d = a + b)
# # A tibble: 2 x 3
# a b d
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
However, in the codes below, while the calculations themselves (i.e., adding) work, the names of the new columns are different from what I expected.
data %>%
dplyr::mutate(!!rlang::parse_expr(expr))
# # A tibble: 2 x 3
# a b `d = a + b`
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
data %>%
dplyr::mutate(!!rlang::parse_quo(expr, env = rlang::global_env()))
# # A tibble: 2 x 3
# a b `d = a + b`
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
data %>%
dplyr::mutate(rlang::eval_tidy(rlang::parse_expr(expr)))
# # A tibble: 2 x 3
# a b `rlang::eval_tidy(rlang::parse_expr(expr))`
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
How can I properly use an expression in dplyr::mutate?
My question is similar to this, but in my example, the new variable (d) and its definition (a + b) are given in a single character vector (expr).
Lets first look at what kind of expressions dplyr::mutate takes to create named variables: we need a named list that contains an expression to create variables based on that expression with the given list element name.
library(tidyverse)
data <- tibble::tibble(
a = c(1, 2),
b = c(3, 4)
)
expr <- "d = a + b"
# let's rewrite the string above as named list containing an expression.
expr2 <- list(d = expr(a + b))
# this works as expected:
data %>%
mutate(!!! expr2)
#> # A tibble: 2 x 3
#> a b d
#> <dbl> <dbl> <dbl>
#> 1 1 3 4
#> 2 2 4 6
Now we simply need a function that transforms a string into a named list containing the expression of the right-hand side of the equation. The name needs to be the left-hand side of the equation. We can do this with regular string manipulations. Finally we need to transform the right-hand side of the equation from a string into an expression. We can use base R's str2lang here.
create_expr_ls <- function(str_expr) {
expr_nm <- str_extract(str_expr, "^\\w+")
expr_code <- str_replace_all(str_expr, "(^\\w+\\s?=\\s?)(.*)", "\\2")
set_names(list(str2lang(expr_code)), expr_nm)
}
expr3 <- create_expr_ls(expr)
data %>%
mutate(!!! expr3)
#> # A tibble: 2 x 3
#> a b d
#> <dbl> <dbl> <dbl>
#> 1 1 3 4
#> 2 2 4 6
Created on 2022-01-23 by the reprex package (v0.3.0)
Any of these work. The second is similar to the first but does not require that rlang be on the search path. The third and fourth also work if the d= part is not present in expr in which case default names are used. The last one uses only base R and is also the shortest.
data %>% mutate(within(., !!parse_expr(expr)))
data %>% mutate(within(., !!parse(text = expr)))
data %>% mutate(data, !!parse_expr(sprintf("tibble(%s)", expr)))
data %>% { eval_tidy(parse_expr(sprintf("mutate(., %s)", expr))) }
within(data, eval(parse(text = expr))) # base R
Note
Assume this premable:
library(dplyr)
library(rlang)
# input
data <- tibble(a = c(1, 2), b = c(3, 4))
expr <- "d = a + b"
To get the desired name for the mutated column, you can still use the same syntax and assign the results to a column with the preferred name. To get this name you can use a regular expression to find what is before = and then remove any leading or trailing spaces that might exist.
expr <- "x = a * b"
col_name <- trimws(str_extract(expr,"[^=]+"))
data %>%
dplyr::mutate(!!col_name := !!rlang::parse_expr(expr))
# A tibble: 2 × 3
a b x
<dbl> <dbl> <dbl>
1 1 3 3
2 2 4 8
data %>%
dplyr::mutate(!!col_name := !!rlang::parse_quo(expr, env = rlang::global_env()))
# A tibble: 2 × 3
a b x
<dbl> <dbl> <dbl>
1 1 3 3
2 2 4 8
data %>%
dplyr::mutate(!!col_name := rlang::eval_tidy(rlang::parse_expr(expr)))
# A tibble: 2 × 3
a b x
<dbl> <dbl> <dbl>
1 1 3 3
2 2 4 8

Renaming columns in a dataframe based on a vector

I was able to do this, but was wondering if there was a more elegant way, possibly with dplyr rename?
# Create dataframe with three named columns
tb <- tibble(col1 = 1:3, col2 = 1:3, col3 = 1:3)
#> # A tibble: 3 x 3
#> col1 col2 col3
#> <int> <int> <int>
#> 1 1 1 1
#> 2 2 2 2
#> 3 3 3 3
# Named vector with replacement names
new_names <- c(col1 = "Column 1", col3 = "Col3")
#> col1 col3
#> "Column 1" "Col3"
# Rename columns within dataframe
tb <- new_names[colnames(tb)] %>%
coalesce(colnames(tb)) %>%
setNames(object = tb, nm = .)
#> # A tibble: 3 x 3
#> `Column 1` col2 Col3
#> <int> <int> <int>
#> 1 1 1 1
#> 2 2 2 2
#> 3 3 3 3
# loading dplyr
pacman::p_load(dplyr)
# rename() syntax demands:
# LHS - a new column name
# RHS - an existing column name
# can be either a named vector or a named list
c('Column 1' = 'col1', 'Col3' = 'col3') -> x
# the unquote-splice (!!!) operator unquotes and splices its argument
rename(tibble(col1 = 1:3, col2 = 1:3, col3 = 1:3), !!!x)
#> # A tibble: 3 x 3
#> `Column 1` col2 Col3
#> <int> <int> <int>
#> 1 1 1 1
#> 2 2 2 2
#> 3 3 3 3
You can find more about it here:
a good book
And here: pretty documentation
Pipe operators are kinda slow so you ought to try to avoid using them when not needed.

How to use vector of column names as input into dplyr::group_by()?

I want to create a function based on dplyr that performs certain operations on subsets of data. The subsets are defined by values of one or more key columns in the dataset. When only one column is used to identify subsets, my code works fine:
set.seed(1)
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5)
)
group_key <- "g1"
aggregate <- function(df, by) {
df %>% group_by(!!sym(by)) %>% summarize(a = mean(a))
}
aggregate(df, by = group_key)
This works as expected and returns something like this:
# A tibble: 2 x 2
g1 a
<dbl> <dbl>
1 1 1.5
2 2 4
Unfortunately everything breaks down if I change group_key:
group_key <- c("g1", "g2")
aggregate(df, by = group_key)
I get an error: Only strings can be converted to symbols, which I think comes from rlang::sym(). Replacing it with syms() does not work since I get a list of names, on which group_by() chokes.
Any suggestions would be appreciated!
You need to use the unquote-splice operator !!!:
aggregate <- function(df, by) {
df %>% group_by(!!!syms(by)) %>% summarize(a = mean(a))
}
group_key <- c("g1", "g2")
aggregate(df, by = group_key)
## A tibble: 4 x 3
## Groups: g1 [2]
# g1 g2 a
# <dbl> <dbl> <dbl>
#1 1 1 1
#2 1 2 4
#3 2 1 2.5
#4 2 2 5
Alternatively, you can use dplyr::group_by_at:
agg <- function(df, by) {
require(dplyr)
df %>% group_by_at(vars(one_of(by))) %>% summarize(a = mean(a))}
group_key <- "g1"
group_keys <- c("g1","g2")
agg(df, by = group_key)
#> # A tibble: 2 x 2
#> g1 a
#> <dbl> <dbl>
#> 1 1 2.5
#> 2 2 3.33
agg(df, by = group_keys)
#> # A tibble: 4 x 3
#> # Groups: g1 [2]
#> g1 g2 a
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
#> 2 1 2 4
#> 3 2 1 2.5
#> 4 2 2 5
Update with dplyr 1.0.0
The new across() allows tidyselect functions like all_of which replaces the quote-unqote procedure of NSE. The code looks a bit simpler with that:
aggregate <- function(df, by) {
df %>%
group_by(across(all_of(by))) %>%
summarize(a = mean(a))
}
df %>% aggregate(group_key)

Mass changing columns of a data set to numeric

I've imported an excel data set and want to set nearly all columns (greater than 90) to numeric when they are initially characters. What is the best way to achieve this because importing and changing each to numeric one by one isn't the most efficient approach?
This should do as you wish:
# Random data frame for illustration (100 columns wide)
df <- data.frame(replicate(100,sample(0:1,1000,rep=TRUE)))
# Check column names / return column number (just encase you wanted to check)
colnames(df)
# Specify columns
cols <- c(1:length(df)) # length(df) is useful as if you ever add more columns at later date
# Or if only want to specify specific column numbers:
# cols <- c(1:100)
#With help of magrittr pipe function change all to numeric
library(magrittr)
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
# Check our columns are numeric
str(df)
Assuming your data is already imported with all character columns, you can convert the relevant columns to numeric using mutate_at by position or name:
suppressPackageStartupMessages(library(tidyverse))
# Assume the imported excel file has 5 columns a to e
df <- tibble(a = as.character(1:3),
b = as.character(5:7),
c = as.character(8:10),
d = as.character(2:4),
e = as.character(2:4))
# select the columns by position (convert all except 'b')
df %>% mutate_at(c(1, 3:5), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# or drop the columns that shouldn't be used ('b' and 'd' should stay as chr)
df %>% mutate_at(-c(2, 4), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# select the columns by name
df %>% mutate_at(c("a", "c", "d", "e"), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4

Resources