Function with a data frame column name as input in R - r

What is the best way to nest a function operation on a data frame in another function? I want to write a function which takes a data frame and a column name and then does something on that column and returns the modified data frame like below:
library(dplyr)
func = function(df, col){
df = df %>% mutate(col = col + 1)
return(df)
}
new_df = func(cars, 'speed')
But this raises an error because col is not a string in the function and I am not sure how to replace it with a function input argument other than strings. Any idea how to fix this with minimum effort?

To use dplyr code in function you have to use non-standard evaluation. In this case using {{}} in the function would do.
library(dplyr)
func = function(df, col) {
df = df %>% mutate({{col}} := {{col}} + 1)
return(df)
}
new_df = func(cars, speed)
head(cars)
# speed dist
#1 4 2
#2 4 10
#3 7 4
#4 7 22
#5 8 16
#6 9 10
head(new_df)
# speed dist
#1 5 2
#2 5 10
#3 8 4
#4 8 22
#5 9 16
#6 10 10
You can read more about non-standard evaluation here https://dplyr.tidyverse.org/articles/programming.html

I think you mean that you want col to be numeric? so that you can + 1. If that is correct see below.
library(dplyr)
func = function(df, col){
df = df %>% mutate(col = as.numeric(col) + 1)
return(df)
}
new_df = func(cars, 'speed')
Another alternative would be to use the index of the column name as the function argument, instead of the string name of the column.
That might look something like
library(dplyr)
func = function(df, col_index){
col_name <- colnames(df)[col_index]
df = df %>% mutate(col_name = col_name + 1)
return(df)
}
new_df = func(cars, 2)

Related

R functions: use argument as name within the function

I have created a user function in R to multiply two columns to create a third (within a series), so this function creates 4 new columns.
create_mult_var <- function(.data){
.data <-.data%>%
mutate(Q4_1_4 = Q4_1_2_TEXT*Q4_1_3_TEXT) %>%
mutate(Q4_2_4 = Q4_2_2_TEXT*Q4_2_3_TEXT) %>%
mutate(Q4_3_4 = Q4_3_2_TEXT*Q4_3_3_TEXT) %>%
mutate(Q4_4_4 = Q4_4_2_TEXT*Q4_4_3_TEXT)
.data
I am trying to modify this function so that I can apply it to a different set of columns that match the same type. For instance, if I want to repeat this on the series of columns that start with "Q8", I know I can do the following:
create_mult_var_2 <- function(.data){
.data <-.data%>%
mutate(Q8_1_4 = Q8_1_2_TEXT*Q8_1_3_TEXT) %>%
mutate(Q8_2_4 = Q8_2_2_TEXT*Q8_2_3_TEXT) %>%
mutate(Q8_3_4 = Q8_3_2_TEXT*Q8_3_3_TEXT) %>%
mutate(Q8_4_4 = Q8_4_2_TEXT*Q8_4_3_TEXT)
.data
}
Instead of creating a different function for each of the Q4 and Q8 series, I would like to add the "Q4" or "Q8" as an argument. I tried this below, but R would not accept this as an argument this way. Is there a way to achieve my desired outcome?
This does not work:
create_mult_var <- function(.data,question){
.data <-.data%>%
mutate(question_1_4 = question_1_2_TEXT*question_1_3_TEXT) %>%
mutate(question_2_4 = question_2_2_TEXT*question_2_3_TEXT) %>%
mutate(question_3_4 = question_3_2_TEXT*question_3_3_TEXT) %>%
mutate(question_4_4 = question_4_2_TEXT*question_4_3_TEXT)
.data
}
I would like to modify the function, such as that I can use the following:
data_in %>% create_mult_var("Q4") %>% create_mult_var("Q8")
Or something similar to create these new columns? Any suggestions are appreciated! Thank you! If this is a bad idea, any suggestions for how I should approach this?
We could use paste and evaluate with !!
create_mult_var_2 <- function(.data, pat){
.data <-.data%>%
mutate(!! str_c(pat, '_1_4') :=
!! rlang::sym(str_c(pat, '_1_2_TEXT')) *
!! rlang::sym(str_c(pat, '_1_3_TEXT')))
.data
}
create_mult_var_2(data_in, "Q4")
# Q4_1_2_TEXT Q4_1_3_TEXT Q4_1_4
#1 1 5 5
#2 2 6 12
#3 3 7 21
#4 4 8 32
Also, based on the pattern showed, this can be automated as well
library(dplyr)
library(stringr)
create_mult_var_3 <- function(.data, pat) {
.data %>%
mutate(across(matches(str_c("^", pat, "_\\d+_2")), ~
.* get(str_replace(cur_column(), '_2_TEXT', '_3_TEXT')),
.names = '{.col}_new')) %>%
rename_at(vars(ends_with('_new')),
~ str_replace(., '\\d+_TEXT_new', '4'))
}
-testing
create_mult_var_3(data_in, "Q4")
# Q4_1_2_TEXT Q4_1_3_TEXT Q4_1_4
#1 1 5 5
#2 2 6 12
#3 3 7 21
#4 4 8 32
data
data_in <- data.frame(Q4_1_2_TEXT = 1:4, Q4_1_3_TEXT = 5:8)

Dplyr _if verbs with predicate function referring to the column names & multiple conditions?

I'm trying to use mutate_if or select_if, etc, verbs with column names within the predicate function.
See example below:
> btest <- data.frame(
+ sjr_first = c('1','2','3',NA, NA, '6'),
+ jcr_first = c('1','2','3',NA, NA, '6'),
+ sjr_second = LETTERS[1:6],
+ jcr_second = LETTERS[1:6],
+ sjr_third = as.character(seq(6)),
+ jcr_fourth = seq(6) + 5,
+ stringsAsFactors = FALSE)
>
> btest %>% select_if(.predicate = ~ str_match(names(.), 'jcr'))
Error in selected[[i]] <- eval_tidy(.p(column, ...)) :
replacement has length zero
I'm aware I could use btest %>% select_at(vars(dplyr::matches('jcr'))) but my goal here is actually to combine the column name condition with another condition (e.g. is.numeric) using mutate_if() to operate on a subset of my columns. However I'm not sure how to get the first part with the name matching to work...
You can do:
btest %>%
select_if(str_detect(names(.), "jcr") & sapply(., is.numeric))
jcr_fourth
1 6
2 7
3 8
4 9
5 10
6 11
Tidyverse solution:
require(dplyr)
# Return (get):
btest %>%
select_if(grepl("jcr", names(.)) & sapply(., is.numeric))
# Mutate (set):
btest %>%
mutate_if(grepl("jcr", names(.)) & sapply(., is.numeric), funs(paste0("whatever", .)))
Base R solution:
# Return (get):
btest[,grepl("jcr", names(btest)) & sapply(btest, is.numeric), drop = FALSE]
# Mutate (set):
btest[,grepl("jcr", names(btest)) & sapply(btest, is.numeric)] <- paste0("whatever", unlist(btest[,grepl("jcr", names(btest)) & sapply(btest, is.numeric)]))
You could separate two select_if calls
library(dplyr)
library(stringr)
btest %>% select_if(str_detect(names(.), 'jcr')) %>% select_if(is.numeric)
# jcr_fourth
#1 6
#2 7
#3 8
#4 9
#5 10
#6 11
We cannot combine the two calls because the first one operates on entire dataframe together whereas the second one operates column-wise.

Renaming columns according to vector inside pipe

I have a data.frame df with columns A and B:
df <- data.frame(A = 1:5, B = 11:15)
There's another data.frame, df2, which I'm building by various calculations that ends up having generic column names X1 and X2, which I cannot control directly (because it passes through being a matrix at one point). So it ends up being something like:
mtrx <- matrix(1:10, ncol = 2)
mtrx %>% data.frame()
I would like to rename the columns in df2 to be the same as df. I could, of course, do it after I finish building df2 with a simple assigning:
names(df2)<-names(df)
My question is - is there a way to do this directly within the pipe? I can't seem to use dplyr::rename, because these have to be in the form of newname=oldname, and I can't seem to vectorize it. Same goes to the data.frame call itself - I can't just give it a vector of column names, as far as I can tell. Is there another option I'm missing? What I'm hoping for is something like
mtrx %>% data.frame() %>% rename(names(df))
but this doesn't work - gives error Error: All arguments must be named.
Cheers!
You can use setNames
mtrx %>%
data.frame() %>%
setNames(., nm = names(df))
# A B
#1 1 6
#2 2 7
#3 3 8
#4 4 9
#5 5 10
Or use purrr's equivalent set_names
mtrx %>%
data.frame() %>%
purrr::set_names(., nm = names(df))
A third option is "names<-"
mtrx %>%
data.frame() %>%
"names<-"(names(df))
We can use rename_all from tidyverse
library(tidyverse)
mtrx %>%
as.data.frame %>%
rename_all(~ names(df))
# A B
# 1 1 6
# 2 2 7
# 3 3 8
# 4 4 9
# 5 5 10

dplyr mutate using character vector of column names

data is a data.frame containing: date, a, b, c, d columns. Last 4 is numeric
Y.columns <- c("a")
X.columns <- c("b","c","d")
what i need:
data.mutated <- data %>%
mutate(Y = a, X = b+c+d) %>%
select(date,Y,X)
but i would like to pass mutate arguments from character vector,
i tried the following:
Y.string <- paste(Y.columns, collapse='+')
X.string <- paste(X.columns, collapse='+')
data.mutated <- data %>%
mutate(Y = UQ(Y.string), X = UQ(X.string)) %>%
select(date,Y,X)
But it didn't work. any help is appreciated.
To use tidyeval with UQ, you need to first parse your expressions to a quosure with parse_quosure from rlang (Using mtcars as example, since OP's question is not reproducible):
Y.columns <- c("cyl")
X.columns <- c("disp","hp","drat")
Y.string <- paste(Y.columns, collapse='+')
X.string <- paste(X.columns, collapse='+')
library(dplyr)
library(rlang)
mtcars %>%
mutate(Y = UQ(parse_quosure(Y.string)),
X = UQ(parse_quosure(X.string))) %>%
select(Y,X)
or with !!:
mtcars %>%
mutate(Y = !!parse_quosure(Y.string),
X = !!parse_quosure(X.string)) %>%
select(Y,X)
Result:
Y X
1 6 273.90
2 6 273.90
3 4 204.85
4 6 371.08
5 8 538.15
6 6 332.76
7 8 608.21
8 4 212.39
9 4 239.72
10 6 294.52
...
Note:
mutate_ has now deprecated, so I think tidyeval with quosure's and UQ is the new way to go.

Error when combining dplyr inside a function

I'm trying to figure out what I'm doing wrong here. Using the following training data I compute some frequencies using dplyr:
group.count <- c(101,99,4)
data <- data.frame(
by = rep(3:1,group.count),
y = rep(letters[1:3],group.count))
data %>%
group_by(by) %>%
summarise(non.miss = sum(!is.na(y)))
Which gives me the outcome I'm looking for. However, when I try to do it as a function:
res0 <- function(x1,x2) {
output = data %>%
group_by(x2) %>%
summarise(non.miss = sum(!is.na(x1)))
}
res0(y,by)
I get an error (index out of bounds).
Can anybody tell me what I'm missing?
Thanks on advance.
You can't do this like that in dplyr.
The problem is that you are passing it a NULL object at the moment. by doesn't exist anywhere. Your first thought might be to pass "by" but this won't work with dplyr either. What dplyr is doing here is trying to group_by the variable x2 which is not a part of your data.frame. To show this, make your data.frame as such:
data <- data.frame(
x2 = rep(3:1,group.count),
x1 = rep(letters[1:3],group.count)
)
Then call your function again and it will return the expected output.
I suggest changing the name of your dataframe to df.
This is basically what you have done:
df %>%
group_by(by) %>%
summarise(non.miss = sum(!is.na(y)))
which produces this:
# by non.miss
#1 1 4
#2 2 99
#3 3 101
but to count the number of observations per group, you could use length, which gives the same answer:
df %>%
group_by(by) %>%
summarise(non.miss = length(y))
# by non.miss
#1 1 4
#2 2 99
#3 3 101
or, use tally, which gives this:
df %>%
group_by(by) %>%
tally
# by n
#1 1 4
#2 2 99
#3 3 101
Now, you could put that if you really wanted into a function. The input would be the dataframe. Like this:
res0 <- function(df) {
df %>%
group_by(by) %>%
tally
}
res0(df)
# by n
#1 1 4
#2 2 99
#3 3 101
This of course assumes that your dataframe will always have the grouping column named 'by'. I realize that these data are just fictional, but avoiding naming columns 'by' might be a good idea because that is its own function in R - it may get a bit confusing reading the code with it in.

Resources