Create column from symbol in mutate (tidy eval) - r

So I want to create a new column called var that has the text "testing" for all rows. I.e. the result should be like mtcars$var <- "testing". I have tried different things such as as_name, as_string...
library(tidyverse)
f <- function(df, hello = testing){
df %>%
mutate(var = hello)
}
f(mtcars)

We can do:
f <- function(df, hello = testing){
hello <- deparse(substitute(hello))
df %>%
mutate(var =rlang::as_name(hello))
}
f(mtcars)
However, as pointed out by #Lionel Henry(see comments below):
deparse will not check for simple inputs and might return a character vector. Then as_name() will fail if a length > 1 vector, or do nothing otherwise since it's already a string
as_name(substitute(hello)) does the same but checks the input is a simple symbol or string. It is more constrained than as_label()
It might therefore be better rewritten as:
f <- function(df, hello = testing){
hello <- as_label(substitute(hello))
df %>%
mutate(var = hello )
}
Or:
f <- function(df, hello = testing){
hello <- rlang::as_name(substitute(hello))
df %>%
mutate(var = hello)
}
Result:
mpg cyl disp hp drat wt qsec vs am gear carb var
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 testing
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 testing
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 testing

Related

Prevent change to dataframe format in R

I have a dataframe that must have a specific layout. Is there a way for me to make R reject any command I attempt that would change the number or names of the columns?
It is easy to check the format of the data table manually, but I have found no way to make R do it for me automatically every time I execute a piece of code.
regards
This doesn’t offer the level of foolproof safety I think you’re looking for (hard to know without more details), but you could define a function operator that yields modified functions that error if changes to columns are detected:
same_cols <- function(fn) {
function(.data, ...) {
out <- fn(.data, ...)
stopifnot(identical(sort(names(.data)), sort(names(out))))
out
}
}
For example, you could create modified versions of dplyr functions:
library(dplyr)
my_mutate <- same_cols(mutate)
my_summarize <- same_cols(summarize)
which work as usual if columns are preserved:
mtcars %>%
my_mutate(mpg = mpg / 2) %>%
head()
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 10.50 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 10.50 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 11.40 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 10.70 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 9.35 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 9.05 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars %>%
my_summarize(across(everything(), mean))
# mpg cyl disp hp drat wt qsec vs am
# 1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
# gear carb
# 1 3.6875 2.8125
But throw errors if changes to columns are made:
mtcars %>%
my_mutate(mpg2 = mpg / 2)
# Error in my_mutate(., mpg2 = mpg/2) :
# identical(sort(names(.data)), sort(names(out))) is not TRUE
mtcars %>%
my_summarize(mpg = mean(mpg))
# Error in my_summarize(., mpg = mean(mpg)) :
# identical(sort(names(.data)), sort(names(out))) is not TRUE
You mention the names and columns need to be the same, also realize that with data.table also names are updated by reference. See the example below.
foo <- data.table(
x = letters[1:5],
y = LETTERS[1:5]
)
colnames <- names(foo)
colnames
# [1] "x" "y"
setnames(foo, colnames, c("a", "b"))
foo[, z := "oops"]
colnames
# [1] "a" "b" "z"
identical(colnames, names(foo))
# [1] TRUE
To check that both the columns and names are unalterated (and in same order here) you can take right away a copy of the names. And after each code run, you can check the current names with the copied names.
foo <- data.table(
x = letters[1:5],
y = LETTERS[1:5]
)
colnames <- copy(names(foo))
setnames(foo, colnames, c("a", "b"))
foo[, z := "oops"]
identical(colnames, names(foo))
[1] FALSE
colnames
# [1] "x" "y"
names(foo)
# [1] "a" "b" "z"

Dplyr: Conditionally rename multiple variables with regex by name

I need to rename multiple variables using a replacement dataframe. This replacement dataframe also includes regex. I would like to use a similar solution proposed here, .e.g
df %>% rename_with(~ newnames, all_of(oldnames))
MWE:
df <- mtcars[, 1:5]
# works without regex
replace_df_1 <- tibble::tibble(
old = df %>% colnames(),
new = df %>% colnames() %>% toupper()
)
df %>% rename_with(~ replace_df_1$new, all_of(replace_df_1$old))
# with regex
replace_df_2 <- tibble::tibble(
old = c("^m", "cyl101|cyl", "disp", "hp", "drat"),
new = df %>% colnames() %>% toupper()
)
old new
<chr> <chr>
1 ^m MPG
2 cyl101|cyl CYL
3 disp DISP
4 hp HP
5 drat DRAT
# does not work
df %>% rename_with(~ replace_df_2$new, all_of(replace_df_2$old))
df %>% rename_with(~ matches(replace_df_2$new), all_of(replace_df_2$old))
EDIT 1:
The solution of #Mael works in general, but there seems to be index issue, e.g. consider the following example
replace_df_2 <- tibble::tibble(
old = c("xxxx", "cyl101|cyl", "yyy", "xxx", "yyy"),
new = mtcars[,1:5] %>% colnames() %>% toupper()
)
mtcars[, 1:5] %>%
rename_with(~ replace_df_2$new, matches(replace_df_2$old))
Results in
mpg MPG disp hp drat
<dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9
meaning that the rename_with function correctly finds the column, but replaces it with the first item in the replacement column. How can we tell the function to take the respective row where a replacement has been found?
So in this example (edit 1), I only want to substitute the second column with "CYL", the rest should be left untouched. The problem is that the function takes the first replacement (MPG) instead of the second (CYL).
Thank you for any hints!
matches should be on the regex-y column:
df %>%
rename_with(~ replace_df_2$new, matches(replace_df_2$old))
MPG CYL DISP HP DRAT
Mazda RX4 21.0 6 160.0 110 3.90
Mazda RX4 Wag 21.0 6 160.0 110 3.90
Datsun 710 22.8 4 108.0 93 3.85
Hornet 4 Drive 21.4 6 258.0 110 3.08
Hornet Sportabout 18.7 8 360.0 175 3.15
Valiant 18.1 6 225.0 105 2.76
#...
If the task is simply to set all col names to upper-case, then this works:
sub("^(.+)$", "\\U\\1", colnames(df), perl = TRUE)
[1] "MPG" "CYL" "DISP" "HP" "DRAT"
In dplyr:
df %>%
rename_with( ~sub("^(.+)$", "\\U\\1", colnames(df), perl = TRUE))
I found a solution using the idea of non standard evaluation from this question and #Maël's answer.
Using map_lgl we create a logical vector that returns TRUE if the column in replace_df_2$old can be found inside the dataframe df. Then we pass this logical vector to replace_df_2$new to get the correct replacement.
df <- mtcars[, 1:5]
df %>%
rename_with(.fn = ~replace_df_2$new[map_lgl(replace_df_2$old,~ any(str_detect(., names(df))))],
.cols = matches(replace_df_2$old))
Result:
mpg CYL disp hp drat
Mazda RX4 21.0 6 160.0 110 3.90

R Edit data frame in function within function

I have a code made up of a lot of functions used for different codes and which will modify a df by adding some columns. I need to have a global function that takes over several of these functions, but since they are functions inside another function, my df does not update this on every function call. Do you have any advice for this problem?
Here is an example of my problem :
f_a<-function(df){
df$x<-1
.GlobalEnv$df <- df
}
f_b<-function(df){
df$y<-1
.GlobalEnv$df <- df
}
f_global<-function(df){
f_a(df)
f_b(df)
}
In this case df will not have the x and y columns created
Thanks
It's generally a bad idea for functions to have "side effects": things are easier to get right if functions are completely self contained. For your example, that would look like this:
f_a<-function(df){
df$x<-1 # This only changes the local copy
df # This returns the local copy as the function result
}
f_b<-function(df){
df$y<-1
df
}
f_global<-function(df){
df <- f_a(df) # This uses f_a to change the local copy
df <- f_b(df) # This uses f_b to make another change
df # This returns the changed dataframe
}
Then you use it like this:
mydf <- data.frame(z = 1)
mydf <- f_global(mydf)
use this operator <<- in the function.as an example:
dat = data.frame(x1 = rep(1,10),x2 = rep(2,10),x3 = rep(3,10))
head(dat)
myFun <- function(x){
print(x)
dat$x1 <<- rep(5,10)
}
myFun(10)
head(dat)
In the call to f_b the input argument df is assigned to .GlobalEnv rewriting the df that already existed there. So f_global first calls f_a and creates a column x, then calls f_b passing it its input data.frame and f_b creates a column y in this df.
All that needs to be changed is f_global:
f_global<-function(df){
f_a(df)
f_b(.GlobalEnv$df)
}
f_global(data.frame(a=1))
df
# a x y
#1 1 1 1
df <- head(mtcars)
f_global(df)
df
# mpg cyl disp hp drat wt qsec vs am gear carb x y
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1 1
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1 1
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1 1
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1 1
Though the code above works and follows the lines of the question, I think that a better strategy is to have f_global change its input argument assigning the return value of each f_* and assign the end result in f_global's parent environment only after all transformations are done.
f_a <- function(df){
df$x <- 1
df
}
f_b <- function(df){
df$y <- 1
df
}
f_global <- function(df){
dfname <- deparse(substitute(df))
df <- f_a(df)
df <- f_b(df)
assign(dfname, df, envir = parent.frame())
invisible(NULL)
}
df1 <- data.frame(a=1)
f_global(df1)
df1
df <- head(mtcars)
f_global(df)
df

Order columns from a list of pre-defined names and ignore column names which don't exist in the list

I want to order a data.table by using a set of predefined names available in a list.
For example:
library(data.table)
dt <- as.data.table(mtcars)
list_name <-c("mpg", "disp", "xyz")
#Order columns
setcolorder(dt, list_name) #requirement: if "xyz" column doesn't exist it should ignore and take the rest
The use case case is that there are multiple data.tables that are getting created and all of them have column names from a list of names. There can be missing column names in some data but the data needs to be ordered as per a list.
output:
dt
disp wt mpg cyl hp drat qsec vs am gear carb
1: 160.0 2.620 21.0 6 110 3.90 16.46 0 1 4 4
2: 160.0 2.875 21.0 6 110 3.90 17.02 0 1 4 4
3: 108.0 2.320 22.8 4 93 3.85 18.61 1 1 4 1
An option is to load all of them in a list and then use setcolorder by looping over the list with lapply and use intersect on the names of the dataset while ordering
lst1 <- list(dt, dt)
lst1 <- lapply(lst1, function(x) setcolorder(x, intersect(list_name, names(x)))
If we need to reuse, create a function
f1 <- function(dat, nm1) {
setcolorder(dat, intersect(nm1, names(dat)))
}
f1(dt, list_name)
f1(dt2, list_name)

Using the dot operator in dplyr::bind_cols

I'm seeing some unexpected behavior with dplyr. I have a specific use case but I will setup a dummy problem to illustrate my point. Why does this work,
library(dplyr)
temp <- bind_cols(mtcars %>% select(-mpg), mtcars %>% select(mpg))
head(temp)
cyl disp hp drat wt qsec vs am gear carb mpg
6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.0
6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.0
But not this,
library(dplyr)
temp <- mtcars %>% bind_cols(. %>% select(-mpg), . %>% select(mpg))
Error in cbind_all(x) : Argument 2 must be length 1, not 32
Thanks for the help.
You need to wrap your function with {} to pipe mtcars into a function within another function like the following:
library(dplyr)
temp1 = mtcars %>% {bind_cols(select(., -mpg), select(., mpg))}
temp2 = bind_cols(mtcars %>% select(-mpg), mtcars %>% select(mpg))
# > identical(temp1, temp2)
# [1] TRUE
Another solution:
myfun <- function(x) {
bind_cols(x %>% select(-mpg), x %>% select(mpg))
}
temp <- mtcars %>% myfun

Resources