tidyeval difference between mutate `:=` and mutate `=` - r

Both these code blocks work even though they use different equal signs, one with := and the other with =. Which is correct and why? I thought tidyeval required := when using dplyr functions, but strange enough = works just fine in my mutate call.
1
library(tidyverse)
set.seed(1)
graph.data <- tibble(cal.date = as.Date(40100:40129, origin = "1899-12-30"),
random_num = rnorm(30, 8, 5))
child_function <- function(df, variable, hor.line = 6) {
variable <- enquo(variable)
df <- mutate(df, mutation := 2 * !! variable, horizontal.line := hor.line)
df
}
child_function(graph.data, variable = random_num, hor.line=8)
2
library(tidyverse)
set.seed(1)
graph.data <- tibble(cal.date = as.Date(40100:40129, origin = "1899-12-30"),
random_num = rnorm(30, 8, 5))
child_function <- function(df, variable, hor.line = 6) {
variable <- enquo(variable)
df <- mutate(df, mutation = 2 * !! variable, horizontal.line = hor.line)
df
}
child_function(graph.data, variable = random_num, hor.line=8)

The := operator's purpose is to allow you to dynamically set the name of variable on the LHS (left hand side) of the equation, which you are not doing here.
In many cases, including this one, we're just concerned with manipulating the RHS. The := would come in handy if you wanted to control the name of the "mutation" variable.
https://dplyr.tidyverse.org/articles/programming.html#setting-variable-names

There is no obligation to put := in that case.
It becomes obligatory when you want to do something like:
child_function <- function(df, variable, hor.line = 6, mt_name = "mutation") {
variable <- enquo(variable)
df <- mutate(df, !! mt_name := 2 * !! variable, horizontal.line = hor.line)
}

A little bit hard to track down, but from ?quasiquotation
Unfortunately R is very strict about the kind of expressions supported
on the LHS of =. This is why we have made the more flexible :=
operator an alias of =. You can use it to supply names, e.g. a := b is
equivalent to a = b. Since its syntax is more flexible you can unquote
on the LHS:

Related

Writing kind of for-loop using dplyr in R

Is there a possibility to use a kind of for loop inside a dplyr syntax? I'm using the following syntax to check the presence of MAP<99, MAP<98 and so on until MAP<1. Not very efficient, so I like to repeat this function from MAP< [100:1].
duur2_vs_diepte <- data_blood_pressure %>%
summarise(
duur_tm99_2 = (sum(MAP<=99))^2,
duur_tm98_2 = (sum(MAP<=98))^2,
duur_tm97_2 = (sum(MAP<=97))^2,
.......
duur_tm4_2 = (sum(MAP<=4))^2,
duur_tm3_2 = (sum(MAP<=3))^2,
duur_tm2_2 = (sum(MAP<=2))^2,
duur_tm1_2 = (sum(MAP<=1))^2
)
This may work for you:
# a helping function to create each column
create_columns <- function(x, mat) {
dt <- mat %>%
filter(MAP <= x) %>%
summarise(sum(MAP, na.rm = TRUE)^2)
names(dt) <- paste0("duur_tm", x, "_2")
dt
}
# get all results together
bind_cols(lapply(100:1, create_columns, data_blood_pressure))

Why am I receiving "invalid 'right' arguement" when using cut()

I created a function in R that creates deciles (or any n-tile) based on a volume metric as opposed to observation counts.
User_Decile <- function(x,n,Output = " "){
require(dplyr)
df <- data_frame(index = seq_along(x),value = x)
x_sum <- sum(df$value)
x_ranges <- x_sum/n
df <- df %>% arrange(value)
df$cumsum <- cumsum(df$value)
df$bins <- cut(df$cumsum, breaks = floor(seq(0, x_sum, x_ranges)),
right = T,
include.lowest = T,
labels = as.integer(seq(1,n,1)))
if(Output == "Summary"){
df <- df %>% group_by(bins)
return(df %>% summarise(Lower_Bound = min(value),
Upper_Bound = max(value) - 1,
Value_sum = sum(value)))}
else {
df <- df %>% arrange(index)
return(as.numeric(df$bins))}
}
(x is a vector of numbers, n is the number of bins/-tiles to group the data into, Output= specifies if you want a summary of the bounds/data or the actual data itself.)
It previous worked well within a program I created to segment some data, but I just tried to use the function again for the first time in a couple months and I'm getting:
Error in .bincode(x, breaks, right, included.lowest) :
invalid 'right' argument
According to the error, the issue is with the 'right' argument in the cut() function. As far as I know, the right= argument is boolean and only takes T or F values. I've tried both, but neither seems to work.
Does anyone have a workaround for this issue, or can recommend another function in place of cut()?
?TRUE states that:
TRUE and FALSE are reserved words denoting logical constants in the R
language, whereas T and F are global variables whose initial values
set to these.
It appears that T is being interpreted as something else here. You should always use TRUE to be on the safe side.

dplyr::mutate unquote RHS

I am wondering how to properly UQ string created variable names on the RHS in dplyr methods like mutate. See the error messages I got in comments in the wilcox.test part of this MWE:
require(dplyr)
dfMain <- data.frame(
base = c(rep('A', 5), rep('B', 5)),
id = letters[1:10],
q0 = rnorm(10)
)
backgs <- list(
A = rnorm(13),
B = rnorm(11)
)
fun <- function(dfMain, i = 0){
pcol <- sprintf('p%i', i)
qcol <- sprintf('q%i', i)
(
dfMain %>%
group_by(id) %>%
mutate(
!!pcol := ifelse(
!is.nan(!!qcol) &
length(backgs[[base]]),
wilcox.test(
# !!(qcol) - backgs[[base]]
# object 'base' not found
# (!!qcol) - backgs[[base]]
# non-numeric argument to binary operator
(!!qcol) - backgs[[base]]
)$p.value,
NaN
)
)
)
}
dfMain <- dfMain %>% fun()
I guess at !!(qcol) ... it is interpreted as I would like to unquote the whole expression not only the variable name that's why it does not find base? I also found out that (!!qcol) returns the string itself so no surprise the - operator is unable to handle it.
Your code should work as you expect by changing the line where you define qcol to:
qcol <- as.symbol(sprintf('q%i', i))
That is, since qcol was a string, you needed to turn it into a symbol before unquoting for it to be evaluated correctly in your mutate. Also I presume the column you wanted to refer to was the q0 column you defined in your data, not a non-existent column named qval0.

Am I using NSE and rlang correctly/reasonably?

I've been reading through programming with dplyr and trying to apply the ideas it describes in my work. I have something that works, but it's unclear to me whether I've done it in the "right" way. Is there something more elegant or concise I could be doing?
I have a tibble where rows are scenarios and columns relate to tests that were run in that scenario. There are two types of columns, those that store a test statistic that was computed in that scenario and those that store the degrees of freedom of that test.
So, here's a small, toy example of the type of data I have:
library(tidyverse)
set.seed(27599)
my_tbl <- data_frame(test1_stat = rnorm(12), test1_df = rep(x = c(1, 2, 3), times = 4),
test2_stat = rnorm(12), test2_df = rep(x = c(1, 2, 3, 4), times = 3))
I want to compute a summary of each test that will be based on both its stat and its df. My example here is that I want to compute the median stat for each group, where groups are defined by df. The groupings are not guaranteed to be the same across tests, nor are the number of groups even guaranteed to be the same.
So, here's what I've done:
get_test_median = function(df, test_name) {
stat_col_name <- paste0(test_name, '_stat')
df_col_name <- paste0(test_name, '_df')
median_col_name <- paste0(test_name, '_median')
df %>%
dplyr::group_by(rlang::UQ(rlang::sym(df_col_name))) %>%
dplyr::summarise(rlang::UQ(median_col_name) := median(x = rlang::UQ(rlang::sym(stat_col_name)), na.rm = TRUE))
}
my_tbl %>% get_test_median(test_name = 'test1')
my_tbl %>% get_test_median(test_name = 'test2')
This works. But is it how an experienced rlang user would do it? I am new to NSE, and a bit surprised to be using two nested rlang functions repeatedly (UQ(sym(.))).
I am happy using UQ rather than !!, just because I'm more comfortable with traditional function notation.
Based on the comments, I got rid of the namespace::function notation and now my function doesn't look so verbose:
get_test_median = function(df, test_name) {
stat_col_name <- paste0(test_name, '_stat')
df_col_name <- paste0(test_name, '_df')
median_col_name <- paste0(test_name, '_median')
df %>%
dplyr::group_by(UQ(sym(df_col_name))) %>%
dplyr::summarise(UQ(median_col_name) := median(x = UQ(sym(stat_col_name)), na.rm = TRUE))
}

Writing a function that changes the value in one column based on search in second column in data.table in R

So I've written a function to change the value in one column based on the value in another one as I need to do this quite often. However, I cannot get it to work. Help is much appreciated!
dt <- data.table(mtcars)
ittt <- function(dt, col.a, col.b, if.a, then.b){
a<-dt[col.a == if.a, col.b := then.b]
}
a<-ittt(dt = dt, col.a = 'mpg', col.b = 'disp', if.a = 21, then.b = 000)
a
dt[mpg == 21, disp := 999]
dt
One way would be. Remember to validate input in your function, to make sure user is passing existing column names, and expected data types.
library(data.table)
dt <- data.table(mtcars)
ittt <- function(dt, col.a, col.b, if.a, then.b, in.place=FALSE){
ii = substitute(lhs == rhs, list(lhs=as.name(col.a), rhs=if.a))
jj = substitute(lhs := rhs, list(lhs=as.name(col.b), rhs=then.b))
if (!in.place) dt = copy(dt)
dt[eval(ii), eval(jj)][]
}
a<-ittt(dt = dt, col.a = 'mpg', col.b = 'disp', if.a = 21, then.b = 000)
a
# if you update in.place, then no assignment to new variable required
ittt(dt = dt, col.a = 'mpg', col.b = 'disp', if.a = 21, then.b = 000, in.place=TRUE)
I see it is pretty much the same as solution proposed by Rich in comments.
You could use the [[ to get a column by using its name as a variable. You could use ifelse to apply an element wise conditions on the content.
ittt <- function(dt, col.a, col.b, if.a, then.b){
dt[[col.b]] <- ifelse(dt[[col.a]] == if.a, then.b, dt[[col.b]])
dt
}

Resources