dplyr mutate using dynamic variable name while respecting group_by - r

I'm trying as per
dplyr mutate using variable columns
&
dplyr - mutate: use dynamic variable names
to use dynamic names in mutate. What I am trying to do is to normalize column data by groups subject to a minimum standard deviation. Each column has a different minimum standard deviation
e.g. (I omitted loops & map statements for convenience)
require(dplyr)
require(magrittr)
data(iris)
iris <- tbl_df(iris)
minsd <- c('Sepal.Length' = 0.8)
varname <- 'Sepal.Length'
iris %>% group_by(Species) %>% mutate(!!varname := mean(pluck(iris,varname),na.rm=T)/max(sd(pluck(iris,varname)),minsd[varname]))
I got the dynamic assignment & variable selection to work as suggested by the reference answers. But group_by() is not respected which, for me at least, is the main benefit of using dplyr here
desired answer is given by
iris %>% group_by(Species) %>% mutate(!!varname := mean(Sepal.Length,na.rm=T)/max(sd(Sepal.Length),minsd[varname]))
Is there a way around this?

I actually did not know much about pluck, so I don't know what went wrong, but I would go for this and this works:
iris %>%
group_by(Species) %>%
mutate(
!! varname :=
mean(!!as.name(varname), na.rm = T) /
max(sd(!!as.name(varname)),
minsd[varname])
)
Let me know if this isn't what you were looking for.

The other answer is obviously the best and it also solved a similar problem that I have encountered. For example, with !!as.name(), there is no need to use group_by_() (or group_by_at or arrange_() (or arrange_at()).
However, another way is to replace pluck(iris,varname) in your code with .data[[varname]]. The reason why pluck(iris,varname) does not work is that, I suppose, iris in pluck(iris,varname) is not grouped. However, .data refer to the tibble that executes mutate(), and so is grouped.
An alternative to as.name() is rlang::sym() from the rlang package.

Related

Is it possible to use group_by in a function for more than one variable?

I created a function that aggregates the numeric values in a dataset, and I use a group_by() function to group the data first. Below is an example of what the code I wrote looks like. Is there a way I can group_by() more than one variable without having to create another input for the function?
agg <- function(data, group){ aggdata <- data %>% group_by({{group}}) %>% select_if(function(col) !is.numeric(col) & !is.integer(col)) %>% summarise_if(is.numeric, sum, na.rm = TRUE) return(aggdata)
Your code has (at least) a misplaced curly brace, and it's a bit difficult to see what you're trying to accomplish without a reproducible example and desired result.
It is possible to pass a vector of variable names to group_by(). For example, the following produces the same result as mtcars %>% group_by(cyl, gear):
my_groups <- c("cyl", "gear")
mtcars %>% group_by(!!!syms(my_groups))
Maybe you could use this syntax within your function definition.

Appropriate times to use cur_data() vs across() in dplyr 1.0

Here is a situation where using cur_data() and across() yields the same result:
library(tidyverse)
by_cyl <- mtcars %>% group_by(cyl)
a <- summarise(by_cyl,head(cur_data(), 2))
b <- summarise(by_cyl,head(across(), 2)))
identical(a,b)
#TRUE
My question is what is the difference between these two functions? When are they interchangeable and when are they not?
I think there is some misunderstanding. across and cur_data() are not related/interchangeable because they both are for different purpose.
across is used to apply a function on multiple columns, cur_data returns the current data for a group without the grouping variable.
In the example shared when you are using across() it does nothing and returns data as it is.
For example,
library(dplyr)
mtcars %>% summarise(across())
returns the same mtcars dataset as it is because the default values in across are .cols = everything() and .fns = NULL. So it applies NULL function to all the columns.
I am not sure what output you were expecting in this case.

tidy eval with e.g. mtcars %>% mutate(target := log(target))

I figured this out while typing my question, but would like to see if there's a cleaner, less code way of doing what I want.
e.g. code block:
target <- "mpg"
# want
mtcars %>%
mutate(target := log(target))
I'd like to update mpg to be the log of mpg based on the variable target.
Looks like I got this working with:
mtcars %>%
mutate(!! rlang::sym(target) := log(!! rlang::sym(target)))
That just reads as pretty repetitive. Is there a 'cleaner', less code way of achieving the same result?
I'm fond of the double curly braces {{var}}, no reason, they are just nicer to read imho but I couldn't get the same results when I tried:
mtcars %>%
mutate(!! rlang::sym(target) := log({{target}}))
What are the various ways I can use tidyeval to mutate a field via transformation based on a pre determined variable to define which field to be transformed, in this case the variable 'target'?
On the lhs of :=, the string can be evaluated with just !!, while on the rhs, it is the value that we need, so we convert to symbol and evaluate (!!)
library(dplyr)
mtcars %>%
mutate(!!target := log(!! rlang::sym(target)))
1) Use mutate_at
library(dplyr)
mtcars %>% mutate_at(target, log)
2) We can use the magrittr %<>% operator:
library(magrittr)
mtcars[[target]] %<>% log
3) Of course this is trivial in base R:
mtcars[[target]] <- log(mtcars[[target]])

How can I write this R expression in the pipe operator format?

I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]

Dplyr or Magrittr - tolower?

Is it possible to set all column names to upper or lower within a dplyr or magrittr chain?
In the example below I load the data and then, using a magrittr pipe, chain it through to my dplyr mutations. In the 4th line I use the tolower function , but this is for a different purpose: to create a new variable with lowercase observations.
mydata <- read.csv('myfile.csv') %>%
mutate(Year = mdy_hms(DATE),
Reference = (REFNUM),
Event = tolower(EVENT)
I'm obviously looking for something like colnames = tolower but know this doesn't work/exist.
I note the dplyr rename function but this isn't really helpful.
In magrittr the colname options are:
set_colnames instead of base R's colnames<-
set_names instead of base R's names<-
I've tried numerous permutations with these but no dice.
Obviously this is very simple in base r.
names(mydata) <- tolower(names(mydata))
However it seems incongruous with the dplyr/magrittr philosophies that you'd have to do that as a clunky one liner, before moving on to an elegant chain of dplyr/magrittr code.
with {dplyr} we can do :
mydata %>% rename_all(tolower)
or
mydata %>% rename(across(everything(), tolower))
iris %>% setNames(tolower(names(.))) %>% head
Or equivalently use replacement function in non-replacement form:
iris %>% `names<-`(tolower(names(.))) %>% head
iris %>% `colnames<-`(tolower(names(.))) %>% head # if you really want to use `colnames<-`
Using magrittr's "compound assignment pipe-operator" %<>% might be, if I understand your question correctly, an even more succinct option.
library("magrittr")
names(iris) %<>% tolower
?`%<>%` # for more
mtcars %>%
set_colnames(value = casefold(colnames(.), upper = FALSE)) %>%
head
casefold is available in base R and can convert in both direction, i.e. can convert to either all upper case or all lower case by using the flag upper, as need might be.
Also colnames() will use only column headers for case conversion.
You could also define a function:
upcase <- function(df) {
names(df) <- toupper(names(df))
df
}
library(dplyr)
mtcars %>% upcase %>% select(MPG)

Resources