Why does add_column assign a letter to the data? - r

I tried reading through R's documentation on the add_column function, but I'm a little confused as to the examples it provides. See below:
# add_column ---------------------------------
df <- tibble(x = 1:3, y = 3:1)
df %>% add_column(z = -1:1, w = 0)
df %>% add_column(z = -1:1, .before = "y")
# You can't overwrite existing columns
try(df %>% add_column(x = 4:6))
# You can't create new observations
try(df %>% add_column(z = 1:5))
What is the purpose of these letters that are being assigned a range? Eg:
z = 1:5
My understanding from the documentation is that add_column() takes in a dataframe and appends it in position based on the .before and .after arguments defaulting to the end of the dataframe.
I'm a little confused here. There is also a "..." argument that takes in Name-value pairs. Is that what I'm seeing with "z = 1:5"? What is the functional purpose of this?

data.frame columns always have a name in R, no exception.
Since add_column adds new columns, you need to specify names for these columns.
… well, technically you don’t need to. The following works:
df %>% add_column(1 : 3)
But add_column auto-generates the column name based on the expression you pass it, and you might not like the result (in this case, it’s literally 1:3, which isn’t a convenient name to work with).
Conversely, the following also works and is perfectly sensible:
z = 1 : 3
df %>% add_column(z)
Result:
# A tibble: 3 x 3
x y z
<int> <int> <int>
1 1 3 1
2 2 2 2
3 3 1 3

Related

How to write a function in R where one of the inputs is meant to go in quotation marks? (" ")

Let's take this hypothetical code for instance:
```{r}
dataset_custom <- function(top, dataset, variable) {
{{dataset}} %>%
count({{variable}}) %>%
top_n(top, n) %>%
arrange(-n) %>%
left_join({{dataset}}, by = "{{variable}}")
}
```
I know this will return an error when I try to run (say) dataset_custom(5, dataset, variable) because of the by = "{{variable}}" in left_join. How do I get around this issue?
I know that when you left join and you want to join it by a particular variable, you do by = "variable" where variable has quotations around it, but how do I do it when I write it as a function and I want the stuff in the quotations to change as depending on the input to the function I'm trying to create?
Thank you!
It is useful if you provide some toy data, like the one found in the example of ?left_join. Note that left_join(df1, df1) is just df1. Instead, we can use a 2nd data argument.
df1 <- tibble(x = 1:3, y = c("a", "a", "b"))
df2 <- tibble(x = c(1, 1, 2), z = c("first", "second", "third"))
df1 %>% left_join(df2, by = "x")
f <- function(data, data2, variable) {
var <- deparse(substitute(variable))
data %>%
count({{ variable }}) %>%
arrange(-n) %>%
left_join(data2, by = var)
}
f(df1, df2, x)
x n z
<dbl> <int> <chr>
1 1 1 first
2 1 1 second
3 2 1 third
4 3 1 NA
# and
f(df2, df1, x)
x n y
<dbl> <int> <chr>
1 1 2 a
2 2 1 a
for this to work we need to use defusing operations so that the input is evaluated correctly. Figuratively speaking, using {{ }} as the by argument is like using a hammer instead of sandpaper for polishing things - it is a forcing operation where none should happen.

Map readr::type_convert to specific columns only

readr::type_convert guesses the class of each column in a data frame. I would like to apply type_convert to only some columns in a data frame (to preserve other columns as character). MWE:
# A data frame with multiple character columns containing numbers.
df <- data.frame(A = letters[1:10],
B = as.character(1:10),
C = as.character(1:10))
# This works
df %>% type_convert()
Parsed with column specification:
cols(
A = col_character(),
B = col_double(),
C = col_double()
)
A B C
1 a 1 1
2 b 2 2
...
However, I would like to only apply the function to column B (this is a stylised example; there may be multiple columns to try and convert). I tried using purrr::map_at as well as sapply, as follows:
# This does not work
map_at(df, "B", type_convert)
Error in .f(.x[[i]], ...) : is.data.frame(df) is not TRUE
# This does not work
sapply(df["B"], type_convert)
Error in FUN(X[[i]], ...) : is.data.frame(df) is not TRUE
Is there a way to apply type_convert selectively to only some columns of a data frame?
Edit: #ekoam provides an answer for type_convert. However, applying this answer to many columns would be tedious. It might be better to use the base::type.convert function, which can be mapped:
purrr::map_at(df, "B", type.convert) %>%
bind_cols()
# A tibble: 10 x 3
A B C
<chr> <int> <chr>
1 a 1 1
2 b 2 2
Try this:
df %>% type_convert(cols(B = "?", C = "?", .default = "c"))
Guess the type of B; any other character column stays as is. The tricky part is that if any column is not of a character type, then type_convert will also leave it as is. So if you really have to type_convert, maybe you have to first convert all columns to characters.
type_convert does not seem to support it. One trick which I have used a few times is using combination of select & bind_cols as shown below.
df %>%
select(B) %>%
type_convert() %>%
bind_cols(df %>% select(-B))

Add summarize variable in multiple statements using dplyr?

In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5

Parse and Evaluate Column of String Expressions in R?

How can I parse and evaluate a column of string expressions in R as part of a pipeline?
In the example below, I produce my desired column, evaluated. But I know this isn't the right approach. I tried taking a tidyverse approach. But I'm just very confused.
library(tidyverse)
df <- tibble(name = LETTERS[1:3],
to_evaluate = c("1-1+1", "iter+iter", "4*iter-1"),
evaluated = NA)
iter = 1
for (i in 1:nrow(df)) {
df[i,"evaluated"] <- eval(parse(text=df$to_evaluate[[i]]))
}
print(df)
# # A tibble: 3 x 3
# name to_evaluate evaluated
# <chr> <chr> <dbl>
# 1 A 1-1+1 1
# 2 B iter+iter 2
# 3 C 4*iter-1 3
As part of a pipeline, I tried:
df %>% mutate(evaluated = eval(parse(text=to_evaluate)))
df %>% mutate(evaluated = !!parse_exprs(to_evaluate))
df %>% mutate(evaluated = parse_exprs(to_evaluate))
df %>% mutate(evaluated = eval(parse_expr(to_evaluate)))
df %>% mutate(evaluated = parse_exprs(to_evaluate))
df %>% mutate(evaluated = eval(parse_exprs(to_evaluate)))
df %>% mutate(evaluated = eval_tidy(parse_exprs(to_evaluate)))
None of these work.
You can try:
df %>%
rowwise() %>%
mutate(iter = 1,
evaluated = eval(parse(text = to_evaluate))) %>%
select(-iter)
name to_evaluate evaluated
<chr> <chr> <dbl>
1 A 1-1+1 1
2 B iter+iter 2
3 C 4*iter-1 3
Following this logic, also other possibilities could work. Using rlang::parse_expr():
df %>%
rowwise() %>%
mutate(iter = 1,
evaluated = eval(rlang::parse_expr(to_evaluate))) %>%
select(-iter)
On the other hand, I think it is important to quote #Martin Mächler:
The (possibly) only connection is via parse(text = ....) and all good
R programmers should know that this is rarely an efficient or safe
means to construct expressions (or calls). Rather learn more about
substitute(), quote(), and possibly the power of using
do.call(substitute, ......).
Here's a slightly different way that does everything within mutate.
df %>% mutate(
evaluated = pmap_dbl(., function(name, to_evaluate, evaluated)
eval(parse(text=to_evaluate)))
)
# A tibble: 3 x 3
name to_evaluate evaluated
<chr> <chr> <dbl>
1 A 1-1+1 1
2 B iter+iter 2
3 C 4*iter-1 3
Note that values of additional variables (such as iter=1 in your case) can be passed directly to eval():
df %>%
mutate( evaluated = map_dbl(to_evaluate, ~eval(parse(text=.x), list(iter=1))) )
One advantage is that it automatically restricts the scope of the variable, keeping its value right next to where it is used.

Make column of input items with purrr::map_df using .id without duplicating inputs for named vector

I often want to map over a vector of column names in a data frame, and keep track of the output using the .id argument. But to write the column names related to each map iteration into that .id column seems to require doubling up their name in the input vector - in other words, by naming each column name with its own name. If I don't name the column with its own name, then .id just stores the index of the iteration.
This is expected behavior, per the purrr::map docs:
.id
Either a string or NULL. If a string, the output will contain a variable with that name, storing either the name (if .x is named) or the index (if .x is unnamed) of the input.
But my approach feels a little clunky, so I imagine I'm missing something. Is there a better way to get a list of the columns I'm iterating over, that doesn't require writing each column name twice in the input vector? Any suggestions would be much appreciated!
Here's an example to work with:
library(rlang)
library(tidyverse)
tb <- tibble(foo = rnorm(10), bar = rnorm(10))
cols_once <- c("foo", "bar")
cols_once %>% map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# A tibble: 2 x 2
var avg <-- var stores only the iteration index
<chr> <dbl>
1 1 -0.0519
2 2 0.204
cols_twice <- c("foo" = "foo", "bar" = "bar")
cols_twice %>% map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# A tibble: 2 x 2
var avg <-- var stores the column names
<chr> <dbl>
1 foo -0.0519
2 bar 0.204
Here's an alternative solution for your specific scenario using summarize_at and gather:
tb %>% summarize_at( cols_once, mean ) %>% gather( var, avg )
# # A tibble: 2 x 2
# var avg
# <chr> <dbl>
# 1 foo 0.374
# 2 bar 0.0397
In a more general scenario, I don't think there's a way around naming your cols_once when working with map_dfr, because of the expected behavior you pointed out in your question. However, you can use the "snake case" wrapper for setNames() to do it more elegantly:
cols_once %>% set_names %>%
map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# # A tibble: 2 x 2
# var avg
# <chr> <dbl>
# 1 foo 0.374
# 2 bar 0.0397
You could create your input vector easily with:
setNames(names(tb), names(tb))
So your code would be:
setNames(names(tb), names(tb)) %>%
map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
Edit following your comment:
Still not the solution you are hoping for, but when you don't use all the column names, you could still use setNames() and subset the ones you want (or subset out the ones you don't).
tb <- tibble(foo = rnorm(10), bar = rnorm(10), taz = rnorm(10))
setNames(names(tb), names(tb))[-3]

Resources