I'm new to R and am trying to learn how to create my own functions.
While the following function works fine:
#---------------------
# this works fine
#---------------------
func <- function(df) {
new_df <- unite(df, key, c("Sepal.Length","Sepal.Width"), sep = " ", remove = FALSE, na.rm = FALSE)
return(new_df)
}
new_iris <- func(iris)
, this function where the unite function's third argument is now parameterized:
#---------------------
# this does not work
#---------------------
func <- function(df, keycols) {
new_df <- unite(df, key, keycols, sep = " ", remove = FALSE, na.rm = FALSE)
return(new_df)
}
keycols <- quote(c("Sepal.Length","Sepal.Width"))
new_iris <- func(iris, keycols)
generates the following error message:
Error: Must subset columns with a valid subscript vector.
x Subscript has the wrong type language.
i It must be numeric or character.
Is there a way to pass c("Sepal.Length","Sepal.Width") as a parameter? Or some way to make the keycols a parameter for the above user defined function?
Thanks for any guidance.
One way you could achieve this simply by using curly-curly {{}} from rlang package which is a safe option,
library(tidyr)
library(rlang)
iris <- tibble::as_tibble(iris)
# using curly curly from {rlang} ------------------------------------------
func <- function(df, keycols) {
new_df <- unite(df, "key", {{keycols}}, sep = " ", remove = FALSE, na.rm = FALSE)
return(new_df)
}
func(iris, c(Sepal.Length, Sepal.Width)) # passing directly the columns
#> # A tibble: 150 × 6
#> key Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <chr> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
func(iris, c("Sepal.Length", "Sepal.Width")) # passing columns as character vector
#> # A tibble: 150 × 6
#> key Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <chr> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
Created on 2022-07-08 by the reprex package (v2.0.1)
To understand why and how this solution works, look here programming with dplyr
You may adopt either of these methods
library(tidyr)
func <- function(df, ...){
unite(df, key, ..., sep=" ", remove = FALSE, na.rm = FALSE)
}
func(iris, Sepal.Length, Sepal.Width)
#> key Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 4.7 3.2 1.3 0.2 setosa
keycols <- c("Sepal.Length", "Sepal.Width")
func <- function(df, cols){
unite(df, key, !!cols, sep=" ", remove = FALSE, na.rm = FALSE)
}
func(iris, keycols)
#> key Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 4.6 3.1 1.5 0.2 setosa
Created on 2022-07-08 by the reprex package (v2.0.1)
Related
I am writing a custom R function which includes messages that are printed to the R console during the run and are designed to help the user figure out the steps taken by the program; they are not necessarily error messages, as exemplified below:
y = a "data.frame"
stdNames = a set of standard column names on y
if (length(setdiff(names(y), stdNames)) > 0) {
cat('\ndata includes non-standard columns: selecting the standard columns...\n')
... <further code>
}
However, I would like to have the option to turn off these messages at will. Please advise, thank you!
Update: It was not my intention to suppress all the massages in my function. Therefore, I have wrapped some of them with cat(), leaving the rest as message().
Example of message() and suppressMessages() per #MrFlick:
my_fun <- function(y, stdNames) {
if(length(setdiff(names(y), stdNames)) > 0)
message("data includes non-standard columns: selecting the standard columns...")
head(y[, stdNames])
}
# no message expected
my_fun(iris, names(iris))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
# expected message
my_fun(iris, names(iris)[-1])
#> data includes non-standard columns: selecting the standard columns...
#> Sepal.Width Petal.Length Petal.Width Species
#> 1 3.5 1.4 0.2 setosa
#> 2 3.0 1.4 0.2 setosa
#> 3 3.2 1.3 0.2 setosa
#> 4 3.1 1.5 0.2 setosa
#> 5 3.6 1.4 0.2 setosa
#> 6 3.9 1.7 0.4 setosa
# expected message suppressed by user
suppressMessages(my_fun(iris, names(iris)[-1]))
#> Sepal.Width Petal.Length Petal.Width Species
#> 1 3.5 1.4 0.2 setosa
#> 2 3.0 1.4 0.2 setosa
#> 3 3.2 1.3 0.2 setosa
#> 4 3.1 1.5 0.2 setosa
#> 5 3.6 1.4 0.2 setosa
#> 6 3.9 1.7 0.4 setosa
Created on 2021-06-30 by the reprex package (v1.0.0)
I am new to tidyverse. I want to join all columns but one (as the names of the other columns might vary). Here an example with iris that does not work obviously. Thanks :)
library(tidyverse)
dat <- as_tibble(iris)
dat %>% mutate(New = str_c(!Sepal.Length, sep="_"))
We can use select to select the columns that we want to paste and apply str_c with do.call.
library(tidyverse)
dat %>% mutate(New = do.call(str_c, c(select(., !Sepal.Length), sep="_")))
However, using unite would be simpler.
dat %>% unite(New, !Sepal.Length, sep="_", remove= FALSE)
# Sepal.Length New Sepal.Width Petal.Length Petal.Width Species
# <dbl> <chr> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5_1.4_0.2_setosa 3.5 1.4 0.2 setosa
# 2 4.9 3_1.4_0.2_setosa 3 1.4 0.2 setosa
# 3 4.7 3.2_1.3_0.2_setosa 3.2 1.3 0.2 setosa
# 4 4.6 3.1_1.5_0.2_setosa 3.1 1.5 0.2 setosa
# 5 5 3.6_1.4_0.2_setosa 3.6 1.4 0.2 setosa
# 6 5.4 3.9_1.7_0.4_setosa 3.9 1.7 0.4 setosa
# 7 4.6 3.4_1.4_0.3_setosa 3.4 1.4 0.3 setosa
# 8 5 3.4_1.5_0.2_setosa 3.4 1.5 0.2 setosa
# 9 4.4 2.9_1.4_0.2_setosa 2.9 1.4 0.2 setosa
#10 4.9 3.1_1.5_0.1_setosa 3.1 1.5 0.1 setosa
# … with 140 more rows
using base
dat <- iris
cols <- grepl("Sepal.Length", names(dat))
tmp <- dat[, !cols]
dat$new <- apply(tmp, 1, paste0, collapse = "_")
head(dat)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
#> 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
#> 2 4.9 3.0 1.4 0.2 setosa 3.0_1.4_0.2_setosa
#> 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
#> 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
#> 5 5.0 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
#> 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
Created on 2021-02-01 by the reprex package (v1.0.0)
We can reduce
library(dplyr)
library(purrr)
library(stringr)
dat %>%
mutate(New = select(., -Sepal.Length) %>%
reduce(str_c, sep="_"))
# A tibble: 150 x 6
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
# 2 4.9 3 1.4 0.2 setosa 3_1.4_0.2_setosa
# 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
# 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
# 5 5 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
# 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
# 7 4.6 3.4 1.4 0.3 setosa 3.4_1.4_0.3_setosa
# 8 5 3.4 1.5 0.2 setosa 3.4_1.5_0.2_setosa
# 9 4.4 2.9 1.4 0.2 setosa 2.9_1.4_0.2_setosa
#10 4.9 3.1 1.5 0.1 setosa 3.1_1.5_0.1_setosa
# … with 140 more rows
library(tidyverse)
iris %>% as_tibble() %>% select(everything())
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 140 more rows
Say I want to select everything in the iris data frame except Species. How do I list this one exception while utilizing tidyselect::everything()?
My actual pipe is below, and when
... %>%
group_by(`ID`) %>%
fill(everything, .direction = "updown") %>%
... %>%
and I get the following error:
Error: Column ID can't be modified because it's a grouping variable
You would do
iris %>% as_tibble() %>% select(-Species)
but assuming you have good reason not to want that, here's a way using everything()
iris %>% as_tibble() %>% select(setdiff(everything(), one_of("Species")))
#> # A tibble: 150 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
#> 7 4.6 3.4 1.4 0.3
#> 8 5 3.4 1.5 0.2
#> 9 4.4 2.9 1.4 0.2
#> 10 4.9 3.1 1.5 0.1
#> # ... with 140 more rows
(or just iris %>% as_tibble() %>% select(setdiff(everything(), 5)) if it's acceptable)
I would like to create a function that takes a data frame and a character vector containing column names as input, and uses tidy verse quoting functions inside in a safe manner.
I believe I have a working example of what I want to do. I would like to know if there is a more elegant solution or I am thinking about this problem incorrectly (perhaps I shouldn't want to do this?). From what I can tell, in order to avoid variable scoping issues I need to wrap the column names in .data[[]] and make it an expression before unquoting for tidy verse NSE verbs.
Regarding previous questions this answer is along the right lines but I want to abstract the code into a function. A github issue
asks about this but using rlang::syms won't work as far as I can tell because the
combination of the column labels with .data makes it an expression not a symbol.
Here
and here
solve the problem but as far as I can tell don't account for a subtle bug in which the variables can leak
in from the environment if they don't exist as column labels in the dataframe or the solutions don't work for the input being a vector of labels.
# Setup
suppressWarnings(suppressMessages(library("dplyr")))
suppressWarnings(suppressMessages(library("rlang")))
# define iris with and without Sepal.Width column
iris <- tibble::as_tibble(iris)
df_with_missing <- iris %>% select(-Sepal.Width)
# This should not be findable by my function
Sepal.Width <- iris$Sepal.Width * -1
################
# Now lets try a function for which we programmatically define the column labels
programmatic_mutate_y <- function(df, col_names, safe = FALSE) {
# Add .data[[]] to the col_names to make evalutation safer
col_exprs <- rlang::parse_exprs(
purrr::map_chr(
col_names,
~ glue::glue(stringr::str_c('.data[["{.x}"]]'))
)
)
output <- dplyr::mutate(df, product = purrr::pmap_dbl(list(!!!col_exprs), ~ prod(...)))
output
}
################
# The desired output
testthat::expect_error(programmatic_mutate_y(df_with_missing, c("Sepal.Width", "Sepal.Length")))
programmatic_mutate_y(iris, c("Sepal.Width", "Sepal.Length"))
#> # A tibble: 150 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 17.8
#> 2 4.9 3 1.4 0.2 setosa 14.7
#> 3 4.7 3.2 1.3 0.2 setosa 15.0
#> 4 4.6 3.1 1.5 0.2 setosa 14.3
#> 5 5 3.6 1.4 0.2 setosa 18
#> 6 5.4 3.9 1.7 0.4 setosa 21.1
#> 7 4.6 3.4 1.4 0.3 setosa 15.6
#> 8 5 3.4 1.5 0.2 setosa 17
#> 9 4.4 2.9 1.4 0.2 setosa 12.8
#> 10 4.9 3.1 1.5 0.1 setosa 15.2
#> # … with 140 more rows
Created on 2019-08-09 by the reprex package (v0.3.0)
I think you are making things complicated. With the _at variant, you can use strings as arguments in almost every dplyr functions. purrr::pmap_dbl() is used to map calculation by rows.
programmatic_mutate_y_v1 <- function(df, col_names, safe = FALSE) {
df["product"] <- purrr::pmap_dbl(dplyr::select_at(df,col_names),prod)
return(df)
}
programmatic_mutate_y_v1(iris, c("Sepal.Width", "Sepal.Length"))
# A tibble: 150 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 17.8
2 4.9 3 1.4 0.2 setosa 14.7
3 4.7 3.2 1.3 0.2 setosa 15.0
4 4.6 3.1 1.5 0.2 setosa 14.3
5 5 3.6 1.4 0.2 setosa 18
6 5.4 3.9 1.7 0.4 setosa 21.1
7 4.6 3.4 1.4 0.3 setosa 15.6
8 5 3.4 1.5 0.2 setosa 17
9 4.4 2.9 1.4 0.2 setosa 12.8
10 4.9 3.1 1.5 0.1 setosa 15.2
# ... with 140 more rows
We can turn col_names into a single expression with parse_expr and paste, then unquote when being evaluated in mutate:
library(dplyr)
library(rlang)
programmatic_mutate_y <- function(df, col_names){
mutate(df, product = !!parse_expr(paste(col_names, collapse = "*")))
}
Output:
> parse_expr(paste(c("Sepal.Width", "Sepal.Length"), collapse = "*"))
Sepal.Width * Sepal.Length
> programmatic_mutate_y(df_with_missing, c("Sepal.Width", "Sepal.Length"))
> Error: object 'Sepal.Width' not found
> programmatic_mutate_y(iris, c("Sepal.Width", "Sepal.Length"))
# A tibble: 150 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 17.8
2 4.9 3 1.4 0.2 setosa 14.7
3 4.7 3.2 1.3 0.2 setosa 15.0
4 4.6 3.1 1.5 0.2 setosa 14.3
5 5 3.6 1.4 0.2 setosa 18
6 5.4 3.9 1.7 0.4 setosa 21.1
7 4.6 3.4 1.4 0.3 setosa 15.6
8 5 3.4 1.5 0.2 setosa 17
9 4.4 2.9 1.4 0.2 setosa 12.8
10 4.9 3.1 1.5 0.1 setosa 15.2
# ... with 140 more rows
Using dplyr, you can do something like this:
iris %>% head %>% mutate(sum=Sepal.Length + Sepal.Width)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
But above, I referenced the columns by their column names. How can I use 1 and 2 , which are the column indices to achieve the same result?
Here I have the following, but I feel it's not as elegant.
iris %>% head %>% mutate(sum=apply(select(.,1,2),1,sum))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
You can try:
iris %>% head %>% mutate(sum = .[[1]] + .[[2]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
I'm a bit late to the game, but my personal strategy in cases like this is to write my own tidyverse-compliant function that will do exactly what I want. By tidyverse-compliant, I mean that the first argument of the function is a data frame and that the output is a vector that can be added to the data frame.
sum_cols <- function(x, col1, col2){
x[[col1]] + x[[col2]]
}
iris %>%
head %>%
mutate(sum = sum_cols(x = ., col1 = 1, col2 = 2))
An alternative to reusing . in mutate that will respect grouping is to use dplyr::cur_data_all(). From help(cur_data_all)
cur_data_all() gives the current data for the current group (including grouping variables)
Consider the following:
iris %>% group_by(Species) %>% mutate(sum = .[[1]] + .[[2]]) %>% head
#Error: Problem with `mutate()` column `sum`.
#ℹ `sum = .[[1]] + .[[2]]`.
#ℹ `sum` must be size 50 or 1, not 150.
#ℹ The error occurred in group 1: Species = setosa.
If instead you use cur_data_all(), it works without issue:
iris %>% mutate(sum = select(cur_data_all(),1) + select(cur_data_all(),2)) %>% head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length
#1 5.1 3.5 1.4 0.2 setosa 8.6
#2 4.9 3.0 1.4 0.2 setosa 7.9
#3 4.7 3.2 1.3 0.2 setosa 7.9
#4 4.6 3.1 1.5 0.2 setosa 7.7
#5 5.0 3.6 1.4 0.2 setosa 8.6
#6 5.4 3.9 1.7 0.4 setosa 9.3
The same approach works with the extract operator ([[).
iris %>% mutate(sum = cur_data()[[1]] + cur_data()[[2]]) %>% head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
#1 5.1 3.5 1.4 0.2 setosa 8.6
#2 4.9 3.0 1.4 0.2 setosa 7.9
#3 4.7 3.2 1.3 0.2 setosa 7.9
#4 4.6 3.1 1.5 0.2 setosa 7.7
#5 5.0 3.6 1.4 0.2 setosa 8.6
#6 5.4 3.9 1.7 0.4 setosa 9.3
What do you think about this version?
Inspired by #SavedByJesus's answer.
applySum <- function(df, ...) {
assertthat::assert_that(...length() > 0, msg = "one or more column indexes are required")
mutate(df, Sum = apply(as.data.frame(df[, c(...)]), 1, sum))
}
iris %>%
head(2) %>%
applySum(1, 2)
#
### output
#
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
#
### you can select and sum more then two columns by the same function
#
iris %>%
head(2) %>%
applySum(1, 2, 3, 4)
#
### output
#
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sum
1 5.1 3.5 1.4 0.2 setosa 10.2
2 4.9 3.0 1.4 0.2 setosa 9.5
To address the issue that #pluke is asking about in the comments, dplyr doesn't really support column index.
Not a perfect solution, but you can use base R to get around this
iris[1] <- iris[1] + iris[2]
This can now (packageVersion("dplyr") >= 1.0.0) be done very nicely with the combination of dplyr::rowwise() and dplyr::c_across().
library(dplyr)
packageVersion("dplyr")
#> [1] '1.0.10'
iris %>%
head %>%
rowwise() %>%
mutate(sum = sum(c_across(c(1, 2))))
#> # A tibble: 6 × 6
#> # Rowwise:
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 8.6
#> 2 4.9 3 1.4 0.2 setosa 7.9
#> 3 4.7 3.2 1.3 0.2 setosa 7.9
#> 4 4.6 3.1 1.5 0.2 setosa 7.7
#> 5 5 3.6 1.4 0.2 setosa 8.6
#> 6 5.4 3.9 1.7 0.4 setosa 9.3
Created on 2022-11-01 with reprex v2.0.2