Dplyr write a function with column names as inputs - r

I'm writing a function that I'm going to use on multiple columns in dplyr, but I'm having trouble passing column names as inputs to functions for dplyr.
Here's an example of what I want to do:
df<-tbl_df(data.frame(group=rep(c("A", "B"), each=3), var1=sample(1:100, 6), var2=sample(1:100, 6)))
example<-function(colname){
df %>%
group_by(group)%>%
summarize(output=mean(sqrt(colname)))%>%
select(output)
}
example("var1")
Output should look like
df %>%
group_by(group)%>%
summarize(output=mean(sqrt(var1)))%>%
select(output)
I've found a few similar questions, but nothing that I could directly apply to my problem, so any help is appreciated. I've tried some solutions involving eval, but I honestly don't know what exactly I'm supposed to be passing to eval.

Is this what you expected?
df<-tbl_df(data.frame(group=rep(c("A", "B"), each=3), var1=sample(1:100, 6), var2=sample(1:100, 6)))
example<-function(colname){
df %>%
group_by(group)%>%
summarize(output=mean(sqrt(colname)))%>%
select(output)
}
example( quote(var1) )
#-----
Source: local data frame [2 x 1]
output
1 7.185935
2 8.090866

The accepted answer does not work anymore in R 3.6 / dplyr 0.8.
As suggested in another answer, one can use !!as.name()
This works for me:
df<-tbl_df(data.frame(group=rep(c("A", "B"), each=3), var1=sample(1:100, 6), var2=sample(1:100, 6)))
example<-function(colname){
df %>%
group_by(group)%>%
summarize(output=mean(sqrt(!!as.name(colname)))%>%
select(output)
}
example( quote(var1) )
If one additionally wants to have the column names to assign to in a mutate, then the easiest is to use the assignment :=. For example to replace colname with its square root.
example_mutate<-function(colname){
df %>%
mutate(!!colname := sqrt(!!as.name(colname)))
}
example_mutate( quote(var1) )
quote() can of course be replaced with quotation marks "".

Related

most elegant way to calculate rowSums of colums that start AND end with certain strings, using dplyr

I am working with a dataset of which I want to calculate rowSums of columns that start with a certain string and end with an other specified string, using dplyr (in my example: starts_with('c_') & ends_with('_f'))
My current code is as follows (and works fine):
df <- df %>% mutate(row.sum = rowSums(select(select(., starts_with('c_')), ends_with('_f'))))
However, as you can see, using the select() function within a select() function seems a bit messy. Is there a way to combine the starts_with and ends_with within just one select() function? Or do you have other ideas to make this line of code more elegant via using dplyr?
EDIT:
To make the example reproducible:
names <- c('c_first_f', 'c_second_o', 't_third_f', 'c_fourth_f')
values <- c(5, 3, 2, 5)
df <- t(values)
colnames(df) <- names
> df
c_first_f c_second_o t_third_f c_fourth_f
[1,] 5 3 2 5
Thus, here I want to sum the first and fourth column, making the summed value 10.
We could use select_at with matches
library(dplyr)
df %>% select_at(vars(matches("^c_.*_f$"))) %>% mutate(row.sum = rowSums(.))
and with base R :
df$row.sum <- rowSums(df[grep("^c_.*_f$", names(df))])
We can use tidyverse approaches
library(dplyr)
library(purrr)
df %>%
select_at(vars(matches("^c_.*_f$"))) %>%
mutate(rowSum = reduce(., `+`))
Or with new versions of tidyverse, select can take matches
df %>%
select(matches("^c_.*_f$")) %>%
mutate(rowSum = reduce(., `+`))

Modify multiple variable names [duplicate]

I want to add a suffix or prefix to most variable names in a data.frame, typically after they've all been transformed in some way and before performing a join. I don't have a way to do this without breaking up my piping.
For example, with this data:
library(dplyr)
set.seed(1)
dat14 <- data.frame(ID = 1:10, speed = runif(10), power = rpois(10, 1),
force = rexp(10), class = rep(c("a", "b"),5))
I want to get to this result (note variable names):
class speed_mean_2014 power_mean_2014 force_mean_2014
1 a 0.5572500 0.8 0.5519802
2 b 0.2850798 0.6 1.0888116
My current approach is:
means14 <- dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.)))
names(means14)[2:length(names(means14))] <- paste0(names(means14)[2:length(names(means14))], "_mean_2014")
Is there an alternative to that clunky last line that breaks up my pipes? I've looked at select() and rename() but don't want to explicitly specify each variable name, as I usually want to rename all except a single variable and might have a much wider data.frame than in this example.
I'm imagining a final piped command that approximates this made-up function:
appendname(cols = 2:n, str = "_mean_2014", placement = "suffix")
Which doesn't exist as far as I know.
You can pass functions to rename_at, so do
means14 <- dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_all(funs(mean(.))) %>%
rename_at(vars(-class),function(x) paste0(x,"_2014"))
After additional experimenting since posting this question, I've found that the setNames function will work with the piping as it returns a data.frame:
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) %>%
setNames(c(names(.)[1], paste0(names(.)[-1],"_mean_2014")))
class speed_mean_2014 power_mean_2014 force_mean_2014
1 a 0.5572500 0.8 0.5519802
2 b 0.2850798 0.6 1.0888116
This is a bit quicker, but not totally what you want:
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) -> means14
names(means14)[-1] %<>% paste0("_mean_2014")
if you haven't used the %<>%-operator before definitely check this link out, its a super-useful tool.
you can also use it for recomputing or rounding some columns, like this df$meancolumn %<>% round() , and so on, it just comes up very often and just saves you a lot of writing
As of February 2017 you can do this with the dplyr command rename_(...).
In the case of this example you could do.
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) %>%
rename_(names(.)[-1], paste0(names(.)[-1],"_mean_2014")))
This is rather similar to the answer with set_names but works with tibbles too!
This is more of a step back, but you might think of reshaping your data in order to apply the function to multiple years at the same time. This will preserve tidyness. If you're going to want to end up comparing different years, it might make sense to have the year be a separate variable in a dataframe, rather than storing the year in the names. You should be able to use summarise_ to get the mean_year behavior. See http://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
library(dplyr)
library(tidyr)
set.seed(1)
dat14 <- data.frame(ID = 1:10, speed = runif(10), power = rpois(10, 1),
force = rexp(10), class = rep(c("a", "b"),5))
dat14 %>%
gather(variable, value, -ID, -class) %>%
mutate(year = 2014) %>%
group_by(class, year, variable)%>%
summarise(mean = mean(value))`
While Sam Firkes solution using setNames() ist certainly the only solution keeping an unbroken pipe, it will not work with the tbl objects from dplyr, since the column names are not accessible by methods from the usual base R naming functions. Here is a function that you can use within a pipe with tbl objects as well, thanks to this solution by hrbrmstr. It adds predefined prefixes and suffixes at the specified column indices. Default is all columns.
tbl.renamer <- function(tbl,prefix="x",suffix=NULL,index=seq_along(tbl_vars(tbl))){
newnames <- tbl_vars(tbl) # Get old variable names
names(newnames) <- newnames
names(newnames)[index] <- paste0(prefix,".",newnames,suffix)[index] # create a named vector for .dots
rename_(tbl,.dots=newnames) # rename the variables
}
Example usage (Assume auth_users beeing an tbl_sql object):
auth_user %>% tbl_vars
tbl.renamer(auth_user) %>% tbl_vars
auth_user %>% tbl.renamer %>% tbl_vars
auth_user %>% tbl.renamer(index = c(1,5)) %>% tbl_vars

Simple mutate with dplyr gives "wrong result size" error

My data table df has a subject column (e.g. "SubjectA", "SubjectB", ...). Each subject answers many questions, and the table is in long format, so there are many rows for each subject. The subject column is a factor. I want to create a new column - call it subject.id - that is simply a numeric version of subject. So for all rows with "SubjectA", it would be 1; for all rows with "SubjectB", it would be 2; etc.
I know that an easy way to do this with dplyr would be to call df %>% mutate(subject.id = as.numeric(subject)). But I was trying to do it this way:
subj.list <- unique(as.character(df$subject))
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
And I get this error:
Error: wrong result size (12), expected 72 or 1
Why does this happen? I'm not interested in other ways to solve this particular problem. Rather, I worry that my inability to understand this error reflects a deep misunderstanding of dplyr or mutate. My understanding is that this call should be conceptually equivalent to:
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list))
}
But the latter works and the former doesn't. Why?
Reproducible example:
df <- InsectSprays %>% rename(subject = spray)
subj.list <- unique(as.character(df$subject))
# this works
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list)
}
# but this doesn't
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
The issue is that operators and functions are applied in a vectorized way by mutate. Thus, which is applied to the vector produced by as.character(df$subject) == subj.list, not to each row (as in your loop).
Using rowwise as described here would solve the issue: https://stackoverflow.com/a/24728107/3772587
So, this will work:
df %>%
rowwise() %>%
mutate(subject.id = which(as.character(subject) == subj.list))
Since your df$subject is a factor, you could simply do:
df %>% mutate(subj.id=as.numeric(subject))
Or use a left join approach:
subj.df <- df$subject %>%
unique() %>%
as_tibble() %>%
rownames_to_column(var = 'subj.id')
df %>% left_join(subj.df,by = c("subject"="value"))

dplyr change many data types

I have a data.frame:
dat <- data.frame(fac1 = c(1, 2),
fac2 = c(4, 5),
fac3 = c(7, 8),
dbl1 = c('1', '2'),
dbl2 = c('4', '5'),
dbl3 = c('6', '7')
)
To change data types I can use something like
l1 <- c("fac1", "fac2", "fac3")
l2 <- c("dbl1", "dbl2", "dbl3")
dat[, l1] <- lapply(dat[, l1], factor)
dat[, l2] <- lapply(dat[, l2], as.numeric)
with dplyr
dat <- dat %>% mutate(
fac1 = factor(fac1), fac2 = factor(fac2), fac3 = factor(fac3),
dbl1 = as.numeric(dbl1), dbl2 = as.numeric(dbl2), dbl3 = as.numeric(dbl3)
)
is there a more elegant (shorter) way in dplyr?
thx
Christof
Edit (as of 2021-03)
As also pointed out in Eric's answer, mutate_[at|if|all] has been superseded by a combination of mutate() and across(). For reference, I will add the respective pendants to the examples in the original answer (see below):
# convert all factor to character
dat %>% mutate(across(where(is.factor), as.character))
# apply function (change encoding) to all character columns
dat %>% mutate(across(where(is.character),
function(x){iconv(x, to = "ASCII//TRANSLIT")}))
# subsitute all NA in numeric columns
dat %>% mutate(across(where(is.numeric), function(x) tidyr::replace_na(x, 0)))
Original answer
Since Nick's answer is deprecated by now and Rafael's comment is really useful, I want to add this as an Answer. If you want to change all factor columns to character use mutate_if:
dat %>% mutate_if(is.factor, as.character)
Also other functions are allowed. I for instance used iconv to change the encoding of all character columns:
dat %>% mutate_if(is.character, function(x){iconv(x, to = "ASCII//TRANSLIT")})
or to substitute all NA by 0 in numeric columns:
dat %>% mutate_if(is.numeric, function(x){ifelse(is.na(x), 0, x)})
You can use the standard evaluation version of mutate_each (which is mutate_each_) to change the column classes:
dat %>% mutate_each_(funs(factor), l1) %>% mutate_each_(funs(as.numeric), l2)
EDIT - The syntax of this answer has been deprecated, loki's updated answer is more appropriate.
ORIGINAL-
From the bottom of the ?mutate_each (at least in dplyr 0.5) it looks like that function, as in #docendo discimus's answer, will be deprecated and replaced with more flexible alternatives mutate_if, mutate_all, and mutate_at. The one most similar to what #hadley mentions in his comment is probably using mutate_at. Note the order of the arguments is reversed, compared to mutate_each, and vars() uses select() like semantics, which I interpret to mean the ?select_helpers functions.
dat %>% mutate_at(vars(starts_with("fac")),funs(factor)) %>%
mutate_at(vars(starts_with("dbl")),funs(as.numeric))
But mutate_at can take column numbers instead of a vars() argument, and after reading through this page, and looking at the alternatives, I ended up using mutate_at but with grep to capture many different kinds of column names at once (unless you always have such obvious column names!)
dat %>% mutate_at(grep("^(fac|fctr|fckr)",colnames(.)),funs(factor)) %>%
mutate_at(grep("^(dbl|num|qty)",colnames(.)),funs(as.numeric))
I was pretty excited about figuring out mutate_at + grep, because now one line can work on lots of columns.
EDIT - now I see matches() in among the select_helpers, which handles regex, so now I like this.
dat %>% mutate_at(vars(matches("fac|fctr|fckr")),funs(factor)) %>%
mutate_at(vars(matches("dbl|num|qty")),funs(as.numeric))
Another generally-related comment - if you have all your date columns with matchable names, and consistent formats, this is powerful. In my case, this turns all my YYYYMMDD columns, which were read as numbers, into dates.
mutate_at(vars(matches("_DT$")),funs(as.Date(as.character(.),format="%Y%m%d")))
Dplyr across function has superseded _if, _at, and _all. See vignette("colwise").
dat %>%
mutate(across(all_of(l1), as.factor),
across(all_of(l2), as.numeric))
It's a one-liner with mutate_at:
dat %>% mutate_at("l1", factor) %>% mutate_at("l2", as.numeric)
A more general way of achieving column type transformation is as follows:
If you want to transform all your factor columns to character columns, e.g., this can be done using one pipe:
df %>% mutate_each_( funs(as.character(.)), names( .[,sapply(., is.factor)] ))
Or mayby even more simple with convert from hablar:
library(hablar)
dat %>%
convert(fct(fac1, fac2, fac3),
num(dbl1, dbl2, dbl3))
or combines with tidyselect:
dat %>%
convert(fct(contains("fac")),
num(contains("dbl")))
For future readers, if you are ok with dplyr guessing the column types, you can convert the col types of an entire df as if you were originally reading it in with readr and col_guess() with
library(tidyverse)
df %>% type_convert()
Try this
df[,1:11] <- sapply(df[,1:11], as.character)

Filter data frame by character column name (in dplyr)

I have a data frame and want to filter it in one of two ways, by either column "this" or column "that". I would like to be able to refer to the column name as a variable. How (in dplyr, if that makes a difference) do I refer to a column name by a variable?
library(dplyr)
df <- data.frame(this = c(1, 2, 2), that = c(1, 1, 2))
df
# this that
# 1 1 1
# 2 2 1
# 3 2 2
df %>% filter(this == 1)
# this that
# 1 1 1
But say I want to use the variable column to hold either "this" or "that", and filter on whatever the value of column is. Both as.symbol and get work in other contexts, but not this:
column <- "this"
df %>% filter(as.symbol(column) == 1)
# [1] this that
# <0 rows> (or 0-length row.names)
df %>% filter(get(column) == 1)
# Error in get("this") : object 'this' not found
How can I turn the value of column into a column name?
Using rlang's injection paradigm
From the current dplyr documentation (emphasis by me):
dplyr used to offer twin versions of each verb suffixed with an underscore. These versions had standard evaluation (SE) semantics: rather than taking arguments by code, like NSE verbs, they took arguments by value. Their purpose was to make it possible to program with dplyr. However, dplyr now uses tidy evaluation semantics. NSE verbs still capture their arguments, but you can now unquote parts of these arguments. This offers full programmability with NSE verbs. Thus, the underscored versions are now superfluous.
So, essentially we need to perform two steps to be able to refer to the value "this" of the variable column inside dplyr::filter():
We need to turn the variable column which is of type character into type symbol.
Using base R this can be achieved by the function as.symbol()
which is an alias for as.name(). The former is preferred by the
tidyverse developers because it
follows a more modern terminology (R types instead of S modes).
Alternatively, the same can be achieved by rlang::sym() from the tidyverse.
We need to inject the symbol from 1) into the dplyr::filter() expression.
This is done by the so called injection operator !! which is basically syntactic
sugar allowing to modify a piece of code before R evaluates it.
(In earlier versions of dplyr (or the underlying rlang respectively) there used to be situations (incl. yours) where !! would collide with the single !, but this is not an issue anymore since !! gained the right operator precedence.)
Applied to your example:
library(dplyr)
df <- data.frame(this = c(1, 2, 2),
that = c(1, 1, 2))
column <- "this"
df %>% filter(!!as.symbol(column) == 1)
# this that
# 1 1 1
Using alternative solutions
Other ways to refer to the value "this" of the variable column inside dplyr::filter() that don't rely on rlang's injection paradigm include:
Via the tidyselection paradigm, i.e. dplyr::if_any()/dplyr::if_all() with tidyselect::all_of()
df %>% filter(if_any(.cols = all_of(column),
.fns = ~ .x == 1))
Via rlang's .data pronoun and base R's [[:
df %>% filter(.data[[column]] == 1)
Via magrittr's . argument placeholder and base R's [[:
df %>% filter(.[[column]] == 1)
I would steer clear of using get() all together. It seems like it would be quite dangerous in this situation, especially if you're programming. You could use either an unevaluated call or a pasted character string, but you'll need to use filter_() instead of filter().
df <- data.frame(this = c(1, 2, 2), that = c(1, 1, 2))
column <- "this"
Option 1 - using an unevaluated call:
You can hard-code y as 1, but here I show it as y to illustrate how you can change the expression values easily.
expr <- lazyeval::interp(quote(x == y), x = as.name(column), y = 1)
## or
## expr <- substitute(x == y, list(x = as.name(column), y = 1))
df %>% filter_(expr)
# this that
# 1 1 1
Option 2 - using paste() (and obviously easier):
df %>% filter_(paste(column, "==", 1))
# this that
# 1 1 1
The main thing about these two options is that we need to use filter_() instead of filter(). In fact, from what I've read, if you're programming with dplyr you should always use the *_() functions.
I used this post as a helpful reference: character string as function argument r, and I'm using dplyr version 0.3.0.2.
Here's another solution for the latest dplyr version:
df <- data.frame(this = c(1, 2, 2),
that = c(1, 1, 2))
column <- "this"
df %>% filter(.[[column]] == 1)
# this that
#1 1 1
Regarding Richard's solution, just want to add that if you the column is character. You can add shQuote to filter by character values.
For example, you can use
df %>% filter_(paste(column, "==", shQuote("a")))
If you have multiple filters, you can specify collapse = "&" in paste.
df %>$ filter_(paste(c("column1","column2"), "==", shQuote(c("a","b")), collapse = "&"))
The latest way to do this is to use my.data.frame %>% filter(.data[[myName]] == 1), where myName is an environmental variable that contains the column name.
Or using filter_at
library(dplyr)
df %>%
filter_at(vars(column), any_vars(. == 1))
Like Salim B explained above but with a minor change:
df %>% filter(1 == !!as.name(column))
i.e. just reverse the condition because !! otherwise behaves
like
!!(as.name(column)==1)
You can use the across(all_of()) syntax, it takes a string as argument
column = "this"
df %>% filter(across(all_of(column)) == 1)

Resources