How can I parse and evaluate a column of string expressions in R as part of a pipeline?
In the example below, I produce my desired column, evaluated. But I know this isn't the right approach. I tried taking a tidyverse approach. But I'm just very confused.
library(tidyverse)
df <- tibble(name = LETTERS[1:3],
to_evaluate = c("1-1+1", "iter+iter", "4*iter-1"),
evaluated = NA)
iter = 1
for (i in 1:nrow(df)) {
df[i,"evaluated"] <- eval(parse(text=df$to_evaluate[[i]]))
}
print(df)
# # A tibble: 3 x 3
# name to_evaluate evaluated
# <chr> <chr> <dbl>
# 1 A 1-1+1 1
# 2 B iter+iter 2
# 3 C 4*iter-1 3
As part of a pipeline, I tried:
df %>% mutate(evaluated = eval(parse(text=to_evaluate)))
df %>% mutate(evaluated = !!parse_exprs(to_evaluate))
df %>% mutate(evaluated = parse_exprs(to_evaluate))
df %>% mutate(evaluated = eval(parse_expr(to_evaluate)))
df %>% mutate(evaluated = parse_exprs(to_evaluate))
df %>% mutate(evaluated = eval(parse_exprs(to_evaluate)))
df %>% mutate(evaluated = eval_tidy(parse_exprs(to_evaluate)))
None of these work.
You can try:
df %>%
rowwise() %>%
mutate(iter = 1,
evaluated = eval(parse(text = to_evaluate))) %>%
select(-iter)
name to_evaluate evaluated
<chr> <chr> <dbl>
1 A 1-1+1 1
2 B iter+iter 2
3 C 4*iter-1 3
Following this logic, also other possibilities could work. Using rlang::parse_expr():
df %>%
rowwise() %>%
mutate(iter = 1,
evaluated = eval(rlang::parse_expr(to_evaluate))) %>%
select(-iter)
On the other hand, I think it is important to quote #Martin Mächler:
The (possibly) only connection is via parse(text = ....) and all good
R programmers should know that this is rarely an efficient or safe
means to construct expressions (or calls). Rather learn more about
substitute(), quote(), and possibly the power of using
do.call(substitute, ......).
Here's a slightly different way that does everything within mutate.
df %>% mutate(
evaluated = pmap_dbl(., function(name, to_evaluate, evaluated)
eval(parse(text=to_evaluate)))
)
# A tibble: 3 x 3
name to_evaluate evaluated
<chr> <chr> <dbl>
1 A 1-1+1 1
2 B iter+iter 2
3 C 4*iter-1 3
Note that values of additional variables (such as iter=1 in your case) can be passed directly to eval():
df %>%
mutate( evaluated = map_dbl(to_evaluate, ~eval(parse(text=.x), list(iter=1))) )
One advantage is that it automatically restricts the scope of the variable, keeping its value right next to where it is used.
Related
Let's take this hypothetical code for instance:
```{r}
dataset_custom <- function(top, dataset, variable) {
{{dataset}} %>%
count({{variable}}) %>%
top_n(top, n) %>%
arrange(-n) %>%
left_join({{dataset}}, by = "{{variable}}")
}
```
I know this will return an error when I try to run (say) dataset_custom(5, dataset, variable) because of the by = "{{variable}}" in left_join. How do I get around this issue?
I know that when you left join and you want to join it by a particular variable, you do by = "variable" where variable has quotations around it, but how do I do it when I write it as a function and I want the stuff in the quotations to change as depending on the input to the function I'm trying to create?
Thank you!
It is useful if you provide some toy data, like the one found in the example of ?left_join. Note that left_join(df1, df1) is just df1. Instead, we can use a 2nd data argument.
df1 <- tibble(x = 1:3, y = c("a", "a", "b"))
df2 <- tibble(x = c(1, 1, 2), z = c("first", "second", "third"))
df1 %>% left_join(df2, by = "x")
f <- function(data, data2, variable) {
var <- deparse(substitute(variable))
data %>%
count({{ variable }}) %>%
arrange(-n) %>%
left_join(data2, by = var)
}
f(df1, df2, x)
x n z
<dbl> <int> <chr>
1 1 1 first
2 1 1 second
3 2 1 third
4 3 1 NA
# and
f(df2, df1, x)
x n y
<dbl> <int> <chr>
1 1 2 a
2 2 1 a
for this to work we need to use defusing operations so that the input is evaluated correctly. Figuratively speaking, using {{ }} as the by argument is like using a hammer instead of sandpaper for polishing things - it is a forcing operation where none should happen.
I am trying to create a R database including some numerical variable.
While doing this, I made a typing mistake whose result looks weird to me and I would like to understand why (for sure I am missing something, here).
I have tried to look around for possible explanation but haven' t found what I am looking for.
library("dplyr")
library("tidyr")
data <-
data.frame(FS = c(1), FS_name = c("Armenia"), Year = c(2015), class =
c("class190"), area_1000ha = c(66.447)) %>%
mutate(FS_name = as.character(FS_name)) %>%
mutate(Year = as.integer(Year)) %>%
mutate(class = as.character(class)) %>%
tbl_df()
data
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = TRUE)) %>%
ungroup()
As you can see, the mistake is
rm.na=
rather than
na.rm=
When I type correctly, I have the right result on area_1000ha variable (10.5).
If I don't - i.e. keeping rm.na= I get 11.5, instead (+1, in fact).
What am I missing?
I think rm.na=TRUE is added to the sum, and as TRUE is considered as 1, it sums your initial sum and 1.
If you change TRUE to 2 for example
x <- data %>%
group_by(FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = 2)) %>%
ungroup()
The result is
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 12.5
There is no function in R as rm.na hence R is considering it as a variable which has value TRUE i.e. 1.
Try keeping it na.rm = T and you will get the right result.
Even if you change the name of the variable
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, tester = TRUE)) %>%
ungroup()
I have replaced rm.na with tester variable.
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 11.5
On a fairly regular basis I want to pass in strings that function as arguments in code. For context, I often want a section where I can pass in filtering criteria or assumptions that then flow through my analysis, plots, etc. to make it more interactive.
A simple example is below. I've seen the eval/parse solution, but it seems like that makes code chunks unreadable. Is there a better/cleaner/shorter way to do this?
column.names <- c("group1", "group2") #two column names I want to be able to toggle between for grouping
select.column <- group.options[1] #Select the column for grouping
DataTable.summary <-
DataTable %>%
group_by(select.column) %>% #How do I pass that selection in here?
summarize(avg.price = mean(SALES.PRICE))
Well this is just a copy-paste from the tidyverse website: link:(https://dplyr.tidyverse.org/articles/programming.html#programming-recipes).
my_summarise <- function(df, group_var) {
group_var <- enquo(group_var)
print(group_var)
df %>%
group_by(!! group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, g1)
#> <quosure>
#> expr: ^g1
#> env: global
#> # A tibble: 2 x 2
#> g1 a
#> <dbl> <dbl>
#> 1 1 2.5
#> 2 2 3.33
But I think i illustrates your problem. I think what you really want to do is like the code above, i.e. create a function.
You can use the group_by_ function for the example in your question:
library(dplyr)
x <- data.frame(group1 = letters[1:4], group2 = LETTERS[1:4], value = 1:4)
select.colums <- c("group1", "group2")
x %>% group_by_(select.colums[2]) %>% summarize(avg = mean(value))
# A tibble: 4 x 2
# group2 avg
# <fct> <dbl>
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
The *_ family functions in dplyr might also offer a more general solution you are after, although the dplyr documentation says they are deprecated (?group_by_) and might disappear at some point. An analogous expression to the above solution using the tidy evaluation syntax seems to be:
x %>% group_by(!!sym(select.colums[2])) %>% summarize(avg = mean(value))
And for several columns:
x %>% group_by(!!!syms(select.colums)) %>% summarize(avg = mean(value))
This creates a symbol out of a string that is evaluated by dplyr.
I recommend using group_by_at(). It supports both single strings or character vectors:
nms <- c("cyl", "am")
mtcars %>% group_by_at(nms)
I often want to map over a vector of column names in a data frame, and keep track of the output using the .id argument. But to write the column names related to each map iteration into that .id column seems to require doubling up their name in the input vector - in other words, by naming each column name with its own name. If I don't name the column with its own name, then .id just stores the index of the iteration.
This is expected behavior, per the purrr::map docs:
.id
Either a string or NULL. If a string, the output will contain a variable with that name, storing either the name (if .x is named) or the index (if .x is unnamed) of the input.
But my approach feels a little clunky, so I imagine I'm missing something. Is there a better way to get a list of the columns I'm iterating over, that doesn't require writing each column name twice in the input vector? Any suggestions would be much appreciated!
Here's an example to work with:
library(rlang)
library(tidyverse)
tb <- tibble(foo = rnorm(10), bar = rnorm(10))
cols_once <- c("foo", "bar")
cols_once %>% map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# A tibble: 2 x 2
var avg <-- var stores only the iteration index
<chr> <dbl>
1 1 -0.0519
2 2 0.204
cols_twice <- c("foo" = "foo", "bar" = "bar")
cols_twice %>% map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# A tibble: 2 x 2
var avg <-- var stores the column names
<chr> <dbl>
1 foo -0.0519
2 bar 0.204
Here's an alternative solution for your specific scenario using summarize_at and gather:
tb %>% summarize_at( cols_once, mean ) %>% gather( var, avg )
# # A tibble: 2 x 2
# var avg
# <chr> <dbl>
# 1 foo 0.374
# 2 bar 0.0397
In a more general scenario, I don't think there's a way around naming your cols_once when working with map_dfr, because of the expected behavior you pointed out in your question. However, you can use the "snake case" wrapper for setNames() to do it more elegantly:
cols_once %>% set_names %>%
map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# # A tibble: 2 x 2
# var avg
# <chr> <dbl>
# 1 foo 0.374
# 2 bar 0.0397
You could create your input vector easily with:
setNames(names(tb), names(tb))
So your code would be:
setNames(names(tb), names(tb)) %>%
map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
Edit following your comment:
Still not the solution you are hoping for, but when you don't use all the column names, you could still use setNames() and subset the ones you want (or subset out the ones you don't).
tb <- tibble(foo = rnorm(10), bar = rnorm(10), taz = rnorm(10))
setNames(names(tb), names(tb))[-3]
I'm working on building a function that I will manipulate a data frame based on a string. Within the function, I'll build a column name as from the string and use it to manipulate the data frame, something like this:
library(dplyr)
orig_df <- data_frame(
id = 1:3
, amt = c(100, 200, 300)
, anyA = c(T,F,T)
, othercol = c(F,F,T)
)
summarize_my_df_broken <- function(df, my_string) {
my_column <- quo(paste0("any", my_string))
df %>%
filter(!!my_column) %>%
group_by(othercol) %>%
summarize(
n = n()
, total = sum(amt)
) %>%
# I need the original string as new column which is why I can't
# pass in just the column name
mutate(stringid = my_string)
}
summarize_my_df_works <- function(df, my_string) {
my_column <- quo(paste0("any", my_string))
df %>%
group_by(!!my_column, othercol) %>%
summarize(
n = n()
, total = sum(amt)
) %>%
mutate(stringid = my_string)
}
# throws an error:
# Argument 2 filter condition does not evaluate to a logical vector
summarize_my_df_broken(orig_df, "A")
# works just fine
summarize_my_df_works(orig_df, "A")
I understand what the problem is: unquoting the quosure as an argument to filter() in the broken version is not referencing the actual column anyA.
What I don't understand is why it works in summarize(), but not in filter()--why is there a difference?
Right now you are are making quosures of strings, not symbol names. That's not how those are supposed to be used. There's a big difference between quo("hello") and quo(hello). If you want to make a proper symbol name from a string, you need to use rlang::sym. So a quick fix would be
summarize_my_df_broken <- function(df, my_string) {
my_column <- rlang::sym(paste0("any", my_string))
...
}
If you look more closely I think you'll see the group_by/summarize isn't actually working the way you expect either (though you just don't get the same error message). These two do not produce the same results
summarize_my_df_works(orig_df, "A")
# `paste0("any", my_string)` othercol n total
# <chr> <lgl> <int> <dbl>
# 1 anyA FALSE 2 300
# 2 anyA TRUE 1 300
orig_df %>%
group_by(anyA, othercol) %>%
summarize(
n = n()
, total = sum(amt)
) %>%
mutate(stringid = "A")
# anyA othercol n total stringid
# <lgl> <lgl> <int> <dbl> <chr>
# 1 FALSE FALSE 1 200 A
# 2 TRUE FALSE 1 100 A
# 3 TRUE TRUE 1 300 A
Again the problem is using a string instead of a symbol.
You don't have any conditions for filter() in your 'broken' function, you just specify the column name.
Beyond that, I'm not sure if you can insert quosures into larger expressions. For example, here you might try something like:
df %>% filter((!!my_column) == TRUE)
But I don't think that would work.
Instead, I would suggest using the conditional function filter_at() to target the appropriate column. In that case, you separate the quosure from the filter condition:
summarize_my_df_broken <- function(df, my_string) {
my_column <- quo(paste0("any", my_string))
df %>%
filter_at(vars(!!my_column), all_vars(. == TRUE)) %>%
group_by(othercol) %>%
summarize(
n = n()
, total = sum(amt)
) %>%
mutate(stringid = my_string)
}