I want to apply different functions to the same column in a tibble. These functions are stored in a character string. I used to do this with mutate_ and the .dots argument like this:
library(dplyr)
myfuns <- c(f1 = "a^2", f2 = "exp(a)", f3 = "sqrt(a)")
tibble(a = 1:3) %>%
mutate_(.dots = myfuns)
This approach still works fine but mutate_ is deprecated. I tried to achieve the same result with mutate and the rlang package but did not get very far.
In my real example myfuns contains about 200 functions so typing them one by one is not an option.
Thanks in advance.
For simple equations that take a single input, it’s sufficient to supply the function itself, e.g.
iris %>% mutate_at(vars(-Species), sqrt)
Or, when using an equation rather than a simple function, via a formula:
iris %>% mutate_at(vars(-Species), ~ . ^ 2)
When using equations that access more than a single variable, you need to use rlang quosures instead:
area = quo(Sepal.Length * Sepal.Width)
iris %>% mutate(Sepal.Area = !! area)
Here, quo creates a “quosure” — i.e. a quoted representation of your equation, same as your use of strings, except, unlike strings, this one is properly scoped, is directly usable by dplyr, and is conceptually cleaner: It is like any other R expression, except not yet evaluated. The difference is as follows:
1 + 2 is an expression with value 3.
quo(1 + 2) is an unevaluated expression with value 1 + 2 that evaluates to 3, but it needs to be explicitly evaluated. So how do we evaluated an unevaluated expression? Well …:
Then !! (pronounced “bang bang”) unquotes the previously-quoted expression, i.e. evaluates it — inside the context of mutate. This is important, because Sepal.Length and Sepal.Width are only known inside the mutate call, not outside of it.
In all the cases above, the expressions can be inside a list, too. The only difference is that for lists you need to use !!! instead of !!:
funs = list(
Sepal.Area = quo(Sepal.Length * Sepal.Width),
Sepal.Ratio = quo(Sepal.Length / Sepal.Width)
)
iris %>% mutate(!!! funs)
The !!! operation is known as “unquote-splice”. The idea is that it “splices” the list elements of its arguments into the parent call. That is, it seems to modify the call as if it contained the list elements verbatim as arguments (this only works in functions, such as mutate, that support it, though).
Convert your strings to expressions
myexprs <- purrr::map( myfuns, rlang::parse_expr )
then pass those expressions to regular mutate using quasiquotation:
tibble(a = 1:3) %>% mutate( !!!myexprs )
# # A tibble: 3 x 4
# a f1 f2 f3
# <int> <dbl> <dbl> <dbl>
# 1 1 1 2.72 1
# 2 2 4 7.39 1.41
# 3 3 9 20.1 1.73
Note that this will also work with strings / expressions involving multiple columns.
You have only one column, so both approaches below will give you the same result.
You only have to modify your functions' list.
library(dplyr)
myfuns <- c(f1 = ~.^2, f2 = ~exp(.), f3 = ~sqrt(.))
tibble(a = 1:3) %>% mutate_at(vars(a), myfuns)
tibble(a = 1:3) %>% mutate_all(myfuns)
# # A tibble: 3 x 4
# a f1 f2 f3
# <int> <dbl> <dbl> <dbl>
# 1 1 1 2.72 1
# 2 2 4 7.39 1.41
# 3 3 9 20.1 1.73
A base alternative :
myfuns <- c(f1 = "a^2", f2 = "exp(a)", f3 = "sqrt(a)")
df <- data.frame(a = 1:3)
df[names(myfuns)] <- lapply(myfuns , function(x) eval(parse(text= x), envir = df))
df
#> a f1 f2 f3
#> 1 1 1 2.718282 1.000000
#> 2 2 4 7.389056 1.414214
#> 3 3 9 20.085537 1.732051
Created on 2019-07-08 by the reprex package (v0.3.0)
One way using parse_expr from rlang
library(tidyverse)
library(rlang)
tibble(a = 1:3) %>%
mutate(ans = map(myfuns, ~eval(parse_expr(.)))) %>%
#OR mutate(ans = map(myfuns, ~eval(parse(text = .)))) %>%
unnest() %>%
group_by(a) %>%
mutate(temp = row_number()) %>%
spread(a, ans) %>%
select(-temp) %>%
rename_all(~names(myfuns))
# A tibble: 3 x 3
# f1 f2 f3
# <dbl> <dbl> <dbl>
#1 1 2.72 1
#2 4 7.39 1.41
#3 9 20.1 1.73
you can try also a purrr approach
# define the functions
f1 <- function(a) a^2
f2 <- function(a, b) a + b
f3 <- function(b) sqrt(b)
# put all functions in one list
tibble(funs=list(f1, f2, f3)) %>%
# give each function a name
mutate(fun_id=paste0("f", row_number())) %>%
# add to each row/function the matching column profile
# first extract the column names you specified in each function
#mutate(columns=funs %>%
# toString() %>%
# str_extract_all(., "function \\(.*?\\)", simplify = T) %>%
# str_extract_all(., "(?<=\\().+?(?=\\))", simplify = T) %>%
# gsub(" ", "", .) %>%
# str_split(., ",")) %>%
# with the help of Konrad we can use fn_fmls_names
mutate(columns=map(funs, ~ rlang::fn_fmls_names(.))) %>%
# select the columns and add to our tibble/data.frame
mutate(params=map(columns, ~select(df, .))) %>%
# invoke the functions
mutate(results = invoke_map(.f = funs, .x = params)) %>%
# transform to desired output
unnest(results) %>%
group_by(fun_id) %>%
mutate(n=row_number()) %>%
spread(fun_id, results) %>%
left_join(mutate(df, n=row_number()), .) %>%
select(-n)
Joining, by = "n"
# A tibble: 5 x 5
a b f1 f2 f3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 4 3 1
2 4 1 16 5 1
3 5 2 25 7 1.41
4 7 2 49 9 1.41
5 8 2 64 10 1.41
some data
df <- data_frame(
a = c(2, 4, 5, 7, 8),
b = c(1, 1, 2, 2, 2))
Related
I want to calculate the geometrical mean of two vectors in a tibble using the tidyverse. The calculated mean should be done rowwise for the two variables. I wrote the function below to this end, and it worked, but I am just wondering how could this operation be done or written in a more efficient way of coding, with more efficient I mean less code, faster and neater. Any better ideas? Just thinking loud, can map_*() be implemented in this case? I am also aware of using rowwise() but as far as I know lately the author of the tidyverse Hadley Wickham downplayed the use of rowwise() strategically.
A minimal representative example is below:
Reprex
df <- tribble(
~v1, ~ v2,
4, 5,
NA, 7,
2, 2,
3, NA,
NA, NA,
9, 9)
Suggested function
gMean <- function (df, v1, v2){
output <- vector ("double", nrow (df))
for (i in 1:nrow(df)){
output[[i]] <- case_when (!is.na(df$v1[i]) && !is.na(df$v2[i]) ~ ((df$v1[i] * df$v2[i]) ^ 0.5),
is.na (df$v1[i]) && is.na (df$v2[i]) ~ 1,
!is.na(df$v1[i]) && is.na(df$v2[i]) ~ df$v1[i],
is.na(df$v1[i]) && !is.na(df$v2[i]) ~ df$v2[i]
)
}
output
}
output
df %>%
gMean (v1, v2)
[1] 4.472136 7.000000 2.000000 3.000000 1.000000 9.000000
You could also (just) use mutate instead of looping over each row.
In your case, there is no need to map or to use rowwise, and as case_when is evaluating from the bottom and up, you can simplify your is.na calls as well.
df |> mutate(gMean = case_when(is.na(v1) & is.na(v2) ~ 1,
is.na(v1) ~ v2,
is.na(v2) ~ v1,
TRUE ~ sqrt(v1 * v2)))
However, if we want to use rowwise() or map2_dbl() we could use prod to allow for an na.rm-option, and only take the square root (^(1/2)), where 2 values are available. Utilizing that 1/0 is defined as Inf in R, and 1^Inf as 1.
df |>
rowwise() |>
mutate(gMean = prod(v1, v2, na.rm = TRUE) ^ (1 / sum(c(!is.na(v1), !is.na(v2))))) |>
ungroup()
df |>
mutate(gMean = map2_dbl(v1, v2, ~ prod(.x, .y, na.rm = TRUE) ^ (1 / sum(c(!is.na(.x), !is.na(.y))))))
Output:
# A tibble: 6 × 3
v1 v2 gMean
<dbl> <dbl> <dbl>
1 4 5 4.47
2 NA 7 7
3 2 2 2
4 3 NA 3
5 NA NA 1
6 9 9 9
Another possible solution:
library(tidyverse)
df %>%
mutate(gMean = map2_dbl(v1, v2, ~ sqrt(.x * .y)) %>%
coalesce(v1, v2) %>% if_else(is.na(.), 1, .))
#> # A tibble: 6 × 3
#> v1 v2 gMean
#> <dbl> <dbl> <dbl>
#> 1 4 5 4.47
#> 2 NA 7 7
#> 3 2 2 2
#> 4 3 NA 3
#> 5 NA NA 1
#> 6 9 9 9
I have a data.frame (or tiibble or whatever) with an id variable. Often I made some operation for this id with dplyr::group_by, so
data %>%
group_by(id) %>%
summarise/mutate/...()
Often, I have other non-numeric variables that are unique for each id, such as the project or country to which the id belongs and other characteristics of the id (such as gender, etc.). When I use the summarise function above, these other variables ares lost unless I specify, either
data %>%
group_by(id) %>%
summarise(across(c(project, country, gender, ...), unique),...)
or
data %>%
group_by(id, project, country, gender, ...) %>%
summarise()
Is there some functions which detect these variables which are unique for each id, so that one does not have to specify them?
Thank you!
PS: I am asking mainly on dplyr and group_by related functions, but other environments like R-base or data.table are wellcome also.
I did not test it extensively yet it should do the job
library(dplyr)
myData <- tibble(X = c(1, 1, 2, 2, 2, 3),
Y = LETTERS[c(1, 1, 2, 2, 2, 3)],
R = rnorm(6))
myData
#> # A tibble: 6 x 3
#> X Y R
#> <dbl> <chr> <dbl>
#> 1 1 A 0.463
#> 2 1 A -0.965
#> 3 2 B -0.403
#> 4 2 B -0.417
#> 5 2 B -2.28
#> 6 3 C 0.423
group_by_id_vars <- function(.data, ...) {
# group by the prespecified ID variables
.data <- .data %>% group_by(...)
# how many groups do these ID determine
ID_groups <- .data %>% n_groups()
# Get the number of groups if the initial grouping variables are combined
# with other variables
groupVars <- sapply(substitute(list(...))[-1], deparse) #specified grouping Variable
nms <- names(.data) # all variables in .data
res <- sapply(nms[!nms %in% groupVars],
function(x) {
.data %>%
# important to specify add = TRUE to combine the variable
# with the IDs
group_by(across(all_of(x)), .add = TRUE) %>%
n_groups()})
# which combinations are identical, i.e. this variable does not increase the
# number of groups in the data if combined with IDvars
v <- names(res)[which(res == ID_groups)]
# group the data accordingly
.data <- .data %>% ungroup() %>% group_by(across(all_of(c(groupVars, v))))
return(.data)
}
myData %>%
group_by_id_vars(X) %>%
summarise(n = n())
#> `summarise()` regrouping output by 'X' (override with `.groups` argument)
#> # A tibble: 3 x 3
#> # Groups: X [3]
#> X Y n
#> <dbl> <chr> <int>
#> 1 1 A 2
#> 2 2 B 3
#> 3 3 C 1
This is a bit more advanced in application, but what you are looking for are linear combinations of your grouping variables. You can convert these to factors and then use some linear algebra.
You can use findLinearCombos() from caret to locate these. It takes a bit of work to get it all organized how I think you want it though.
Something like this may do the trick. I also have not extensively tested this.
Packages
library(dplyr)
library(caret)
library(purrr)
Function
group_by_lc <- function(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data)) {
# capture the ... and convert to a character vector
.groups <- rlang::ensyms(...)
.groups_chr <- map_chr(.groups, rlang::as_name)
# convert all character and factor variables to a numeric
d <- .data %>%
mutate(across(where(is.factor), as.character),
across(where(is.character), as.factor),
across(where(is.factor), as.integer))
# find linear combinations of the character / factor variables
lc <- caret::findLinearCombos(d)
# see if any of your grouping variables have linear combinations
find_group_match <- function(known_groups, lc_pair) {
if (any(lc_pair %in% known_groups)) unique(c(lc_pair, known_groups)) else NULL
}
# convert column indices to names
lc_pairs <- map(lc$linearCombos, ~ names(d)[.x])
# iteratively look for linear combinations of known grouping variabels
lc_cols <- reduce(lc_pairs, find_group_match, .init = .groups_chr)
# find new grouping variables
added_groups <- rlang::syms(lc_cols[!(lc_cols %in% .groups_chr)])
# apply the grouping to your groups and the linear combinations
group_by(.data, !!!.groups, !!!added_groups, .add = .add, .drop = .drop)
}
Usage
data <- tibble(V = LETTERS[1:10], W = letters[1:10], X = paste0(V, W), Y = rep(LETTERS[1:5], each = 2), Z = runif(10))
group_by_lc(data, W)
Result
You can see how it added in all the other grouping variables. You can rework this all in other ways, the key part is building that added_groups list to find them.
# A tibble: 10 x 5
# Groups: W, X, V [10]
V W X Y Z
<chr> <chr> <chr> <chr> <dbl>
1 A a Aa A 0.884
2 B b Bb A 0.133
3 C c Cc B 0.194
4 D d Dd B 0.407
5 E e Ee C 0.256
6 F f Ff C 0.0976
7 G g Gg D 0.635
8 H h Hh D 0.0542
9 I i Ii E 0.0104
10 J j Jj E 0.464
My question is similar to this question but I need to apply a more complex function across columns and I can't figure out how to apply Lionel's suggested solution to a custom function with a scoped verb like filter_at() or a filter()+across() equivalent. It doesn't look like a "superstache"/{{{}}} operator has been introduced.
Here is a non-programmed example of what I want to do (doesn't use NSE):
library(dplyr)
library(magrittr)
foo <- tibble(group = c(1,1,2,2,3,3),
a = c(1,1,0,1,2,2),
b = c(1,1,2,2,0,1))
foo %>%
group_by(group) %>%
filter_at(vars(a,b), any_vars(n_distinct(.) != 1)) %>%
ungroup
#> # A tibble: 4 x 3
#> group a b
#> <dbl> <dbl> <dbl>
#> 1 2 0 2
#> 2 2 1 2
#> 3 3 2 0
#> 4 3 2 1
I haven't found an equivalent of this filter_at line with filter+across() yet, but since the new(ish) tidyeval functions predate dplyr 1.0 I assume that issue can be set aside. Here is my attempt to make a programmed version where the filtering variables are user-supplied with dots:
my_function <- function(data, ..., by) {
dots <- enquos(..., .named = TRUE)
helperfunc <- function(arg) {
return(any_vars(n_distinct(arg) != length(arg)))
}
dots <- lapply(dots, function(dot) call("helperfunc", dot))
data %>%
group_by({{ by }}) %>%
filter(!!!dots) %>%
ungroup
}
foo %>%
my_function(a, b, group)
#> Error: Problem with `filter()` input `..1`.
#> x Input `..1` is named.
#> i This usually means that you've used `=` instead of `==`.
#> i Did you mean `a == helperfunc(a)`?
I'd love if there were a way to just plug in an NSE operator inside the vars() argument in filter_at and not have to make all these extra calls (I assume this is what a {{{}}} function would do?)
Maybe I'm misunderstanding what the issue is, but the standard pattern of forwarding the dots seems to work fine here:
my_function <- function(data, ..., by) {
data %>%
group_by({{ by }}) %>%
filter_at(vars(...), any_vars(n_distinct(.) != 1)) %>%
ungroup
}
foo %>%
my_function( a, b, by=group ) # works
Here is a way to use across() to achieve this that is covered in vignette("colwise").
my_function <- function(data, vars, by) {
data %>%
group_by({{ by }}) %>%
filter(n_distinct(across({{ vars }}, ~ .x)) != 1) %>%
ungroup()
}
foo %>%
my_function(c(a, b), by = group)
# A tibble: 4 x 3
group a b
<dbl> <dbl> <dbl>
1 2 0 2
2 2 1 2
3 3 2 0
4 3 2 1
An option with across
my_function <- function(data, by, ...) {
dots <- enquos(..., .named = TRUE)
nm1 <- purrr::map_chr(dots, rlang::as_label)
data %>%
dplyr::group_by({{ by }}) %>%
dplyr::mutate(across(nm1, ~ n_distinct(.) !=1, .names = "{col}_ind")) %>%
dplyr::ungroup() %>%
dplyr::filter(dplyr::select(., ends_with('ind')) %>% purrr::reduce(`|`)) %>%
dplyr::select(-ends_with('ind'))
}
my_function(foo, group, a, b)
# A tibble: 4 x 3
# group a b
# <dbl> <dbl> <dbl>
#1 2 0 2
#2 2 1 2
#3 3 2 0
#4 3 2 1
Or with filter/across
foo %>%
group_by(group) %>%
filter(any(!across(c(a,b), ~ n_distinct(.) == 1)))
# A tibble: 4 x 3
# Groups: group [2]
# group a b
# <dbl> <dbl> <dbl>
#1 2 0 2
#2 2 1 2
#3 3 2 0
#4 3 2 1
I'm struggling to get quosures to work inside a map call.
Some toy data:
library(tidyverse)
df <- tibble(
g1 = letters[1:2] %>%
rep(each = 3),
g2 = letters[3:5] %>%
rep(times = 2),
y = runif(6)
)
I can get this function to work, where I enquo a variable before I pass it to group_by:
sum1 <- function(df, g){
g <- enquo(g)
df %>%
group_by(!! g) %>%
summarize(
mu = y %>%
mean
)
}
Calling this function
sum1(df, g2)
gets me the expected result. But if I want to map over multiple grouping variables, (ie g1 & g2)
str_c("g", 1:2) %>%
map(
function(i)
sum1(df, i)
)
Returns the error
Error in grouped_df_impl(data, unname(vars), drop) :
Column `i` is unknown
How can I set up quosures in a map call?
We can use group_by_at and it can take a string as argument
library(tidyverse)
sum1 <- function(df, grps){
map(grps, ~
df %>%
group_by_at(.x) %>%
summarise(mu = mean(y))
)
}
sum1(df, str_c("g", 1:2))
#[[1]]
# A tibble: 2 x 2
# g1 mu
# <chr> <dbl>
#1 a 0.440
#2 b 0.469
#[[2]]
# A tibble: 3 x 2
# g2 mu
# <chr> <dbl>
#1 c 0.528
#2 d 0.592
#3 e 0.243
Regarding the usage of parameters with quosure in function, it is not clear whether it should be a single parameter or multiple parametr
In case if we are going with the string as argument, convert it to symbol (sym) and then evaluate (!!)
sum2 <- function(df, grps){
map(grps, ~
df %>%
group_by(!! rlang::sym(.x)) %>%
summarise(mu = mean(y))
)
}
sum2(df, str_c("g", 1:2))
#[[1]]
# A tibble: 2 x 2
# g1 mu
# <chr> <dbl>
#1 a 0.440
#2 b 0.469
#[[2]]
# A tibble: 3 x 2
# g2 mu
# <chr> <dbl>
#1 c 0.528
#2 d 0.592
#3 e 0.243
Another with quosure to pass multiple groups would be
sum3 <- function(df, ...){
gs <- enquos(...)
map(gs, ~
df %>%
group_by(!! .x) %>%
summarise(mu = mean(y)))
}
sum3(df, g1, g2)
#[[1]]
# A tibble: 2 x 2
# g1 mu
# <chr> <dbl>
#1 a 0.440
#2 b 0.469
#[[2]]
# A tibble: 3 x 2
# g2 mu
# <chr> <dbl>
#1 c 0.528
#2 d 0.592
#3 e 0.243
str_c("g", 1:2) %>%
syms() %>%
map(sum1, df = df)
syms() turns characters into symbols (expected by sum1).
Rewriting map(function(i) sum1(df, i)) as map(sum1, df = df) prevents unwanted evaluation of the promise i that happens when sum1 is wrapped in another function.
Rewriting map(function(i) sum1(df, i)) as map(sum1, df = df) allows to pass the symbols g1 and g2 directly to sum1(), rather than the symbol i.
(Alternatively, str_c("g", 1:2) %>% syms() %>% map(function(i) sum1(df, !! i)) or str_c("g", 1:2) %>% map(function(i) sum1(df, !! sym(i))) work, where !! unquotes i before passing it to sum1().
(Actually this is a bit oversimplified: unquoting doesn't happen before, but when you do enquo(g) in the body of sum1).
Using the tidyverse a lot i often face the challenge of turning named vectors into a data.frame/tibble with the columns being the names of the vector.
What is the prefered/tidyversey way of doing this?
EDIT: This is related to: this and this github-issue
So i want:
require(tidyverse)
vec <- c("a" = 1, "b" = 2)
to become this:
# A tibble: 1 × 2
a b
<dbl> <dbl>
1 1 2
I can do this via e.g.:
vec %>% enframe %>% spread(name, value)
vec %>% t %>% as_tibble
Usecase example:
require(tidyverse)
require(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
Which gives
# A tibble: 2 × 3
a b c
<chr> <chr> <chr>
1 1 2 <NA>
2 1 <NA> 3
This is now directly supported using bind_rows (introduced in dplyr 0.7.0):
library(tidyverse))
vec <- c("a" = 1, "b" = 2)
bind_rows(vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
This quote from https://cran.r-project.org/web/packages/dplyr/news.html explains the change:
bind_rows() and bind_cols() now accept vectors. They are treated as rows by the former and columns by the latter. Rows require inner names like c(col1 = 1, col2 = 2), while columns require outer names: col1 = c(1, 2). Lists are still treated as data frames but can be spliced explicitly with !!!, e.g. bind_rows(!!! x) (#1676).
With this change, it means that the following line in the use case example:
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
can be rewritten as
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(bind_rows)
which is also equivalent to
txt %>% map(read_xml) %>% map(xml_attrs) %>% { bind_rows(!!! .) }
The equivalence of the different approaches is demonstrated in the following example:
library(tidyverse)
library(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
temp <- txt %>% map(read_xml) %>% map(xml_attrs)
# x, y, and z are identical
x <- temp %>% map_df(~t(.) %>% as_tibble)
y <- temp %>% map_df(bind_rows)
z <- bind_rows(!!! temp)
identical(x, y)
#> [1] TRUE
identical(y, z)
#> [1] TRUE
z
#> # A tibble: 2 x 3
#> a b c
#> <chr> <chr> <chr>
#> 1 1 2 <NA>
#> 2 1 <NA> 3
The idiomatic way would be to splice the vector with !!! within a tibble() call so the named vector elements become column definitions :
library(tibble)
vec <- c("a" = 1, "b" = 2)
tibble(!!!vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
Created on 2019-09-14 by the reprex package (v0.3.0)
This works for me: c("a" = 1, "b" = 2) %>% t() %>% tbl_df()
Interestingly you can use the as_tibble() method for lists to do this in one call. Note that this isn't best practice since this isn't an exported method.
tibble:::as_tibble.list(vec)
as_tibble(as.list(c(a=1, b=2)))