Naming a new variable based on a quosure - r

I'm trying to write a custom function that will compute a new variable based on values from a predefined vector of variables (e.g., vector_heavy) and then name the new variable based on an argument provided to the function (e.g., custom_name).
This variable naming is where my quosure skills are failing me. Any help is greatly appreciated.
library(tidyverse)
vector_heavy <- quos(disp, wt, cyl)
cv_compute <- function(data, cv_name, cv_vector){
cv_name <- enquo(cv_name)
data %>%
rowwise() %>%
mutate(!!cv_name = mean(c(!!!cv_vector), na.rm = TRUE)) %>%
ungroup()
}
d <- cv_compute(mtcars, cv_name = custom_name, cv_vector = vector_heavy)
My error message reads:
Error: unexpected '=' in:
" rowwise() %>%
mutate(!!cv_name ="
Removing the !! before cv_name within mutate() will result in a function that calculates a new variable literally named cv_name, and ignoring the custom_name I've included as an argument.
cv_compute <- function(data, cv_name, cv_vector){
cv_name <- enquo(cv_name)
data %>%
rowwise() %>%
mutate(cv_name = mean(c(!!!cv_vector), na.rm = TRUE)) %>%
ungroup()
}
How can I get this function to utilize the custom_name I supply as an argument for cv_name?

You need to use the := helper within mutate. You'll also need quo_name to convert the input to a string.
The mutate line of your function will then look like
mutate(!!quo_name(cv_name) := mean(c(!!!cv_vector), na.rm = TRUE))
In its entirety:
cv_compute <- function(data, cv_name, cv_vector){
cv_name <- enquo(cv_name)
data %>%
rowwise() %>%
mutate(!!quo_name(cv_name) := mean(c(!!!cv_vector), na.rm = TRUE)) %>%
ungroup()
}
cv_compute(mtcars, cv_name = custom_name, cv_vector = vector_heavy)
mpg cyl disp hp drat wt qsec vs am gear carb custom_name
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 56.20667
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 56.29167
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 38.10667
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 89.07167
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 123.81333
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 78.15333

Related

Programming with `{data.table}`: how to name a new column?

The following question seems very basic in programming with data.table, so my apologies if it's a duplicate. I spent time researching but could not find an answer.
I want to create a "user-defined function" that wraps around a data.table wrangling procedure. In this procedure, a new column is created, and I want to let the user set the name of that new column.
Example
Consider the following code that works as-is. I want to wrap it inside a function.
library(data.table)
library(magrittr)
library(tibble)
mtcars %>%
as.data.table() %>%
.[, .(max_mpg = max(mpg)), by = cyl] %>%
as_tibble()
#> # A tibble: 3 x 2
#> cyl max_mpg
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
Created on 2021-10-13 by the reprex package (v0.3.0)
All I want my function to do is let the user set the name of new_colname_of_choice:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")
#> # A tibble: 3 x 2
#> cyl new_colname_of_choice <---------- why this isn't called "my_lovely_colname"?
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
I've tried using curly braces which didn't work either (actually threw an error):
my_wrapper_2 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .({new_colname_of_choice} = max(mpg)), by = cyl] %>%
as_tibble()
}
Error: unexpected '=' in:
" as.data.table() %>%
.[, .({new_colname_of_choice} ="
Which is surprising because curly braces do promote the desired naming ability, but in a different (yet similar) kind of code:
my_wrapper_3 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, {new_colname_of_choice} := max(mpg), by = cyl] %>%
as_tibble()
}
my_wrapper_3(new_colname_of_choice = "my_lovely_colname")
## # A tibble: 32 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb my_lovely_colname <---- SUCCESS!
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 21.4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 21.4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 33.9
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 21.4
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 19.2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 21.4
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 19.2
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 33.9
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 33.9
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 21.4
## # ... with 22 more rows
Bottom line
My conclusion is that the = operator is sensitive to {...} on the LHS. How can I otherwise pass a name (from argument) to the LHS in the initial my_wrapper() example?
EDIT
I'd like to add the dplyr solution for the same problem, taken from the programming with dplyr vignette:
library(dplyr)
my_wrapper_dplyr <- function(new_colname_of_choice) {
mtcars %>%
group_by(cyl) %>%
summarise("{new_colname_of_choice}" := max(mpg))
}
my_wrapper_dplyr("another_lovely_colname")
Which is pretty robust and works in all naming situations I've encountered. Is there a built-in/canonical practice in data.table similar to {dplyr}'s?
With the upcoming data.table version 1.14.3, you'll be able to use the new env parameter:
A new interface for programming on data.table has been added, closing #2655 and many other linked issues. It is built using base R's substitute-like interface via a new env argument to [.data.table. For details see the new vignette programming on data.table, and the new ?substitute2 manual page. Thanks to numerous users for filing requests, and Jan Gorecki for implementing.
# install dev version
install.packages("https://github.com/Rdatatable/data.table/archive/master.tar.gz", repo = NULL, type = "source")
library(tibble)
library(data.table)
my_wrapper_new <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl,
env=list(new_colname_of_choice = new_colname_of_choice)] %>%
as_tibble()
}
my_wrapper_new('test')
# A tibble: 3 x 2
cyl test
<dbl> <dbl>
1 6 21.4
2 4 33.9
3 8 19.2
One thing you can do is separate the creation of the column and the naming of the column like so:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(tempcol = max(mpg)), by = cyl] %>%
setnames(., "tempcol", new_colname_of_choice) %>%
as.tibble()
}
my_wrapper("my_lovely_colname")
Using this method you can use either .(tempcol = max(mpg)) or tempcol := max(mpg)
Using setNames from stats:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, setNames(list(max(mpg)), new_colname_of_choice), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")

R Edit data frame in function within function

I have a code made up of a lot of functions used for different codes and which will modify a df by adding some columns. I need to have a global function that takes over several of these functions, but since they are functions inside another function, my df does not update this on every function call. Do you have any advice for this problem?
Here is an example of my problem :
f_a<-function(df){
df$x<-1
.GlobalEnv$df <- df
}
f_b<-function(df){
df$y<-1
.GlobalEnv$df <- df
}
f_global<-function(df){
f_a(df)
f_b(df)
}
In this case df will not have the x and y columns created
Thanks
It's generally a bad idea for functions to have "side effects": things are easier to get right if functions are completely self contained. For your example, that would look like this:
f_a<-function(df){
df$x<-1 # This only changes the local copy
df # This returns the local copy as the function result
}
f_b<-function(df){
df$y<-1
df
}
f_global<-function(df){
df <- f_a(df) # This uses f_a to change the local copy
df <- f_b(df) # This uses f_b to make another change
df # This returns the changed dataframe
}
Then you use it like this:
mydf <- data.frame(z = 1)
mydf <- f_global(mydf)
use this operator <<- in the function.as an example:
dat = data.frame(x1 = rep(1,10),x2 = rep(2,10),x3 = rep(3,10))
head(dat)
myFun <- function(x){
print(x)
dat$x1 <<- rep(5,10)
}
myFun(10)
head(dat)
In the call to f_b the input argument df is assigned to .GlobalEnv rewriting the df that already existed there. So f_global first calls f_a and creates a column x, then calls f_b passing it its input data.frame and f_b creates a column y in this df.
All that needs to be changed is f_global:
f_global<-function(df){
f_a(df)
f_b(.GlobalEnv$df)
}
f_global(data.frame(a=1))
df
# a x y
#1 1 1 1
df <- head(mtcars)
f_global(df)
df
# mpg cyl disp hp drat wt qsec vs am gear carb x y
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1 1
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1 1
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1 1
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1 1
Though the code above works and follows the lines of the question, I think that a better strategy is to have f_global change its input argument assigning the return value of each f_* and assign the end result in f_global's parent environment only after all transformations are done.
f_a <- function(df){
df$x <- 1
df
}
f_b <- function(df){
df$y <- 1
df
}
f_global <- function(df){
dfname <- deparse(substitute(df))
df <- f_a(df)
df <- f_b(df)
assign(dfname, df, envir = parent.frame())
invisible(NULL)
}
df1 <- data.frame(a=1)
f_global(df1)
df1
df <- head(mtcars)
f_global(df)
df

Recoding turns everything into the same value in R

I'm practicing R and I created a new column that had continuous numbers in them called ROI, and wanted to recode the number values into string values in R like this:
df = mutate(diabetes_df, ROI = ifelse(ROI < 18.5, 'Under', ROI))
df = mutate(diabetes_df, ROI = ifelse(ROI >= 18.5 & ROI <= 25, 'average', ROI))
diabetes_df = mutate(diabetes_df, ROI = ifelse(ROI > 25 & BMI <= 30, 'above average', ROI))
This works normally and it displays these words wherever the condition is met, however when i put the last ifelse statement :
df = mutate(diabetes_df, ROI = ifelse(ROI > 30, 'OVER', ROI))
It turns every value in the new column I made into the OVER value. I was wondering if anyone knew how to make it so that it would only say OVER for where the condition is met?
If ROI is a numeric column, the issue is that you are overwriting a numeric column with text values.
If ROI is not a numeric column, then inequality comparison on text strings works different from how you have assumed.
Note that all you commands take the form: df = mutate(df, ROI = ifelse(ROI <condition>, 'label', ROI). This means you are overwriting the original ROI values, and the replaced values will we used for subsequent comparisons.
Suppose df had only row with ROI = 10 then:
# df:
# ROI = 10
df2 = mutate(df, ROI = ifelse(ROI < 18.5, 'Under', ROI))
# compares 10 < 18.5
# replaces 10 with 'Under'
# df2:
# ROI = 'Under'
df3 = mutate(df2, ROI = ifelse(ROI > 30, 'OVER', ROI))
# compares 'Under' > 30
# After standardizing formats, compares 'Under' > '30' (conversion to string)
# replaces 'Under' with 'OVER'
Two possible solutions:
write to a different column, this is good practice
df %>%
mutate(ROI_label = NA) %>%
mutate(ROI_label = ifelse(ROI < 18.5, 'Under', ROI_label)) %>%
mutate(ROI_label = ifelse(ROI >= 18.5 & ROI <= 25, 'average', ROI_label)) %>%
mutate(ROI_label = ifelse(ROI > 25 & BMI <= 30, 'above average', ROI_label)) %>%
mutate(ROI_label = ifelse(ROI > 30, 'OVER', ROI_label))
use case_when, this is also good practice
df %>%
mutate(ROI = case_when(ROI < 18.5 ~ 'Under',
ROI >= 18.5 & ROI <= 25 ~ 'average',
ROI > 25 & BMI <= 30 ~ 'above average',
ROI > 30 ~ 'OVER'))
Even better, write to a different column and use case_when.
We can replicate the problem with the mtcars data frame. The following code on the third mutate() statement results in all rows getting the wt value set to High because after the first mutate(), the wt column is a vector of character values.
library(dplyr)
data(mtcars)
mtcars <- mutate(mtcars,wt = ifelse(wt < 2.6,"Low", wt))
# at this point, wt is character
str(mtcars$wt)
> str(mtcars$wt)
chr [1:32] "2.62" "2.875" "Low" "3.215" "3.44" "3.46" "3.57" "3.19" "3.15" ...
By the third mutate() all rows meet the condition of TRUE for the if_else() based on a character string comparison where the string values of Low and Medium are greater than the number 3.61.
mtcars <- mutate(mtcars, wt = ifelse( 2.6 <= wt & wt <= 3.61,"Medium",wt))
mtcars <- mutate(mtcars, wt = ifelse( wt > 3.61,"High",wt))
...and the output:
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 High 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 High 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 High 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 High 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 High 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 High 20.22 1 0 3 1
We can prevent this behavior by using case_when(), which makes all of the comparisons to the numeric version of wt in a single pass of the data.
# use case_when()
data(mtcars)
mtcars %>% mutate(wt = case_when(
wt < 2.6 ~ "Low",
wt >= 2.6 & wt <= 3.61 ~ "Medium",
wt > 3.61 ~ "High"
)) %>% head(.)
...and the output:
head(.)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 Medium 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 Medium 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 Low 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 Medium 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 Medium 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 Medium 20.22 1 0 3 1
>
From the comments to this answer, it wasn't clear to the OP how to save the changed column to the existing data frame. The following code snippet addresses that question.
data(mtcars)
mtcars %>% mutate(wt = case_when(
wt < 2.6 ~ "Low",
wt >= 2.6 & wt <= 3.61 ~ "Medium",
wt > 3.61 ~ "High"
)) -> mtcars

R - Unite without NA values [duplicate]

This question already has answers here:
Combine two or more columns in a dataframe into a new column with a new name
(9 answers)
Closed 3 years ago.
I have got multiple columns. All columns do have NA values in some rows. Is it possible to unite these columns without having the NA values in the new column?
Without NA values:
library(dplyr)
unite(mtcars, 'mpg_am', c('mpg','am'))
Creating fake data:
mtcars$NA_1 = ifelse(mtcars$mpg>20, NA, mtcars$mpg)
mtcars$NA_2 = ifelse(mtcars$cyl>6, NA, mtcars$mpg)
unite(mtcars, 'Var1', c('NA_1','NA_2'))
This will create values like
Var1
NA_21
15.5_NA
NA_NA
15.5_21
...
desired output:
Var1
21
15.5
NA
15.5_21
...
We can use unite with na.rm
library(tidyverse)
mtcars %>%
rownames_to_column('rn') %>%
mutate_at(vars(starts_with("NA")), as.character) %>%
unite(Var1, NA_1, NA_2, na.rm = TRUE) %>%
mutate(Var1 = na_if(Var1, "")) %>%
column_to_rownames('rn')
Or another option is coalesce instead of unite
mtcars %>%
mutate(Var1 = str_c(coalesce(NA_1, NA_2), coalesce(NA_2, NA_1), sep="_"))
Or another option is
mtcars %>%
mutate_at(vars(starts_with("NA")), list(~ replace_na(., ''))) %>%
mutate(Var1 = str_remove(na_if(str_c(NA_1, NA_2, sep="_"), '_'), '^_|_$') ) %>%
select(-NA_1, NA_2)
unite has got na.rm parameter which will remove NA values but for that column needs to be of character type.
library(dplyr)
library(tidyr)
mtcars %>%
mutate_at(vars(NA_1, NA_2), as.character) %>%
unite(Var1, NA_1, NA_2, na.rm = TRUE)
# mpg cyl disp hp drat wt qsec vs am gear carb Var1
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.8
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.4
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 18.1_18.1
#.....
However, if both the values are NA then this will return empty values instead of NA, if we need NA strictly we can check for empty values and replace
mtcars %>%
mutate_at(vars(NA_1, NA_2), as.character) %>%
unite(Var1, NA_1, NA_2, na.rm = TRUE)
mutate(Var1 = replace(Var1, Var1 == "", NA_character_))
Without any packages we can use paste0 in base R
cols <- c('NA_1','NA_2')
mtcars["V1"] <- apply(mtcars[cols],1,function(x) paste0(na.omit(x), collapse = "-"))

How can I get the length of an arbitrary data frame when piped in using Tidyr? [duplicate]

This question already has answers here:
Create an ID (row number) column
(10 answers)
Closed 3 years ago.
I have code like this:
bulk <- read_csv("data/food_bulk_raw.csv") %>%
mutate(Treatment = "bulk", Individual = seq_len(Timestamp))
seq_len() is creating a list of 1:length(Timestamp). It works because 'Timestamp' is a column of the data-frame. But let's say I didn't know anything about my data-frame: Perhaps I am creating a function. How could I indicate the length of the data-frame without first saving it as an object like I have below?
data002 <- read_csv("data/data002.csv")
data002 <- mutate(data002, New_Column = 1:nrow(data002))
You could use any of the following
library(tidyverse)
#Option 1
read_csv("data/food_bulk_raw.csv") %>%
mutate(Treatment = "bulk", Individual = seq_len(nrow(.)))
#Option 2
read_csv("data/food_bulk_raw.csv") %>%
mutate(Treatment = "bulk", Individual = seq(nrow(.)))
#Option 3
read_csv("data/food_bulk_raw.csv") %>%
mutate(Treatment = "bulk", Individual = sequence(nrow(.)))
All of these do not depend on any column but uses nrow to create sequence.
Also as #Marius commented, you could also use n() which returns number of rows instead of nrow. So in all of the above options nrow(.) can be replaced with n().
Apart from that we can also use row_number
read_csv("data/food_bulk_raw.csv") %>%
mutate(Treatment = "bulk", Individual = row_number())
To demonstrate, making a function
df_sequence_func <- function(df) {
df %>% mutate(Individual = seq_len(nrow(.)))
}
df_sequence_func(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb Individual
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#....
df_sequence_func(cars)
# speed dist Individual
#1 4 2 1
#2 4 10 2
#3 7 4 3
#4 7 22 4
#5 8 16 5
#6 9 10 6
#....
It returns a sequential row number irrespective of the columns or rows in the dataframe.
We can use data.table methods
library(data.table)
setDT(df)[, seq_len(.N)]
and it can be read with fread
fread("data/food_bulk_raw.csv")[,
c("Treatment", "Individual") := .("bulk", seq_len(.N))][]
Or in tidyverse
library(tidyverse)
rownames_to_column(data002, 'rn')
Or using
data002 %>%
mutate(New_Column = seq_len(n()))
Or in base R
df$newcolumn <- seq(nrow(df))

Resources