R: Further subset a selection using the pipe %>% and placeholder - r

I recently discovered the pipe operator %>%, which can make code more readable. Here is my MWE.
library(dplyr) # for the pipe operator
library(lsr) # for the cohensD function
set.seed(4) # make it reproducible
dat <- data.frame( # create data frame
subj = c(1:6),
pre = sample(1:6, replace = TRUE),
post = sample(1:6, replace = TRUE)
)
dat %>% select(pre, post) %>% sapply(., mean) # works as expected
However, I struggle using the pipe operator in this particular case
dat %>% select(pre, post) %>% cohensD(.$pre, .$post) # piping returns an error
cohensD(dat$pre, dat$post) # classical way works fine
Why is it not possible to subset columns using the placeholder .in combination with $? Is it worthwhile to write this line using a pipe operator %>%, or does it complicate syntax? The classical way of writing this seems more concise.

This would work:
dat %>% select(pre, post) %>% {cohensD(.$pre, .$post)}
Wrapping the last call into curly braces makes it be treated like an expression and not a function call. When you pipe something into an expression, the . gets replaced as expected. I often use this trick to call a function which does not interface well with piping.
What is inside the braces happens to be a function call but could really be any expression of . .

Since you're going from a bunch of data into one (row of) value(s), you're summarizing. in a dplyr pipeline you can then use the summarize function, within the summarize function you don't need to subset and can just call pre and post
Like so:
dat %>% select(pre, post) %>% summarize(CD = cohensD(pre, post))
(The select statement isn't actually necessary in this case, but I left it in to show how this works in a pipeline)

It doesn't work because the . operator has to be used directly as an argument, and not inside a nested function (like $...) in your call.
If you really want to use piping, you can do it with the formula interface, but with a little reshaping before (melt is from reshape2 package):
dat %>% select(pre, post) %>% melt %>% cohensD(value~variable, .)
#### [1] 0.8115027

Related

tidy eval with e.g. mtcars %>% mutate(target := log(target))

I figured this out while typing my question, but would like to see if there's a cleaner, less code way of doing what I want.
e.g. code block:
target <- "mpg"
# want
mtcars %>%
mutate(target := log(target))
I'd like to update mpg to be the log of mpg based on the variable target.
Looks like I got this working with:
mtcars %>%
mutate(!! rlang::sym(target) := log(!! rlang::sym(target)))
That just reads as pretty repetitive. Is there a 'cleaner', less code way of achieving the same result?
I'm fond of the double curly braces {{var}}, no reason, they are just nicer to read imho but I couldn't get the same results when I tried:
mtcars %>%
mutate(!! rlang::sym(target) := log({{target}}))
What are the various ways I can use tidyeval to mutate a field via transformation based on a pre determined variable to define which field to be transformed, in this case the variable 'target'?
On the lhs of :=, the string can be evaluated with just !!, while on the rhs, it is the value that we need, so we convert to symbol and evaluate (!!)
library(dplyr)
mtcars %>%
mutate(!!target := log(!! rlang::sym(target)))
1) Use mutate_at
library(dplyr)
mtcars %>% mutate_at(target, log)
2) We can use the magrittr %<>% operator:
library(magrittr)
mtcars[[target]] %<>% log
3) Of course this is trivial in base R:
mtcars[[target]] <- log(mtcars[[target]])

How can I write this R expression in the pipe operator format?

I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]

With dplyr and enquo my code works but not when I pass to purrr::map

I want to create a plot for each column in a vector called dates. My data frame contains only these columns and I want to group on it, count the occurrences and then plot it.
Below code works, except for map which I want to use to go across a previously unknown number of columns. I think I'm using map correctly, I've had success with it before. I'm new to using quosures but given that my function call works I'm not sure what is wrong. I've looked at several other posts that appear to be set up this way.
df <- data.frame(
date1 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
date2 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
stringsAsFactors = FALSE
)
dates<-names(df)
library(tidyverse)
dates.count<-function(.x){
group_by<-enquo(.x)
df %>% group_by(!!group_by) %>% summarise(count=n()) %>% ungroup() %>% ggplot() + geom_point(aes(y=count,x=!!group_by))
}
dates.count(date1)
map(dates,~dates.count(.x))
I get this error: Error in grouped_df_impl(data, unname(vars), drop) : Column .x is unknown
When you pass the variable names to map() you are using strings, which indicates you need ensym() instead of enquo().
So your function would look like
dates.count <- function(.x){
group_by = ensym(.x)
df %>%
group_by(!!group_by) %>%
summarise(count=n()) %>%
ungroup() %>%
ggplot() +
geom_point(aes(y=count,x=!!group_by))
}
And you would use the variable names as strings for the argument.
dates.count("date2")
Note that tidyeval doesn't always play nicely with the formula interface of map() (I think I'm remembering that correctly). You can always do an anonymous function instead, but in your case where you want to map the column names to a function with a single argument you can just do
map(dates, dates.count)
Using the formula interface in map() I needed an extra !!:
map(dates, ~dates.count(!!.x))

Repeatedly mutate variable using dplyr and purrr

I'm self-taught in R and this is my first StackOverflow question. I apologize if this is an obvious issue; please be kind.
Short Version of my Question
I wrote a custom function to calculate the percent change in a variable year over year. I would like to use purrr's map_at function to apply my custom function to a vector of variable names. My custom function works when applied to a single variable, but fails when I chain it using map_a
My custom function
calculate_delta <- function(df, col) {
#generate variable name
newcolname = paste("d", col, sep="")
#get formula for first difference.
calculate_diff <- lazyeval::interp(~(a + lag(a))/a, a = as.name(col))
#pass formula to mutate, name new variable the columname generated above
df %>%
mutate_(.dots = setNames(list(calculate_diff), newcolname)) }
When I apply this function to a single variable in the mtcars dataset, the output is as expected (although obviously the meaning of the result is non-sensical).
calculate_delta(mtcars, "wt")
Attempt to Apply the Function to a Character Vector Using Purrr
I think that I'm having trouble conceptualizing how map_at passes arguments to the function. All of the example snippets I can find online use map_at with functions like is.character, which don't require additional arguments. Here are my attempts at applying the function using purrr.
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta)
This gives me this error message
Error in paste("d", col, sep = "") :
argument "col" is missing, with no default
I assume this is because map_at is passing vars as the df, and not passing an argument for col. To get around that issue, I tried the following:
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta, df = .)
That throws me this error:
Error: unrecognised index type
I've monkeyed around with a bunch of different versions, including removing the df argument from the calculate_delta function, but I have had no luck.
Other potential solutions
1) A version of this using sapply, rather than purrr. I've tried solving the problem that way and had similar trouble. And my goal is to figure out a way to do this using purrr, if that is possible. Based on my understanding of purrr, this seems like a typical use case.
2) I can obviously think of how I would implement this using a for loop, but I'm trying to avoid that if possible for similar reasons.
Clearly I'm thinking about this wrong. Please help!
EDIT 1
To clarify, I am curious if there is a method of repeatedly transforming variables that accomplishes two things.
1) Generates new variables within the original tbl_df without replacing replace the columns being mutated (as is the case when using dplyr's mutate_at).
2) Automatically generates new variable labels.
3) If possible, accomplishes what I've described by applying a single function using map_at.
It may be that this is not possible, but I feel like there should be an elegant way to accomplish what I am describing.
Try simplifying the process:
delta <- function(x) (x + dplyr::lag(x)) /x
cols <- c("wt", "mpg")
#This
library(dplyr)
mtcars %>% mutate_at(cols, delta)
#Or
library(purrr)
mtcars %>% map_at(cols, delta)
#If necessary, in a function
f <- function(df, cols) {
df %>% mutate_at(cols, delta)
}
f(iris, c("Sepal.Width", "Petal.Length"))
f(mtcars, c("wt", "mpg"))
Edit
If you would like to embed new names after, we can write a custom pipe-ready function:
Rename <- function(object, old, new) {
names(object)[names(object) %in% old] <- new
object
}
mtcars %>%
mutate_at(cols, delta) %>%
Rename(cols, paste0("lagged",cols))
If you want to rename the resulting lagged variables:
mtcars %>% mutate_at(cols, funs(lagged = delta))

Dplyr or Magrittr - tolower?

Is it possible to set all column names to upper or lower within a dplyr or magrittr chain?
In the example below I load the data and then, using a magrittr pipe, chain it through to my dplyr mutations. In the 4th line I use the tolower function , but this is for a different purpose: to create a new variable with lowercase observations.
mydata <- read.csv('myfile.csv') %>%
mutate(Year = mdy_hms(DATE),
Reference = (REFNUM),
Event = tolower(EVENT)
I'm obviously looking for something like colnames = tolower but know this doesn't work/exist.
I note the dplyr rename function but this isn't really helpful.
In magrittr the colname options are:
set_colnames instead of base R's colnames<-
set_names instead of base R's names<-
I've tried numerous permutations with these but no dice.
Obviously this is very simple in base r.
names(mydata) <- tolower(names(mydata))
However it seems incongruous with the dplyr/magrittr philosophies that you'd have to do that as a clunky one liner, before moving on to an elegant chain of dplyr/magrittr code.
with {dplyr} we can do :
mydata %>% rename_all(tolower)
or
mydata %>% rename(across(everything(), tolower))
iris %>% setNames(tolower(names(.))) %>% head
Or equivalently use replacement function in non-replacement form:
iris %>% `names<-`(tolower(names(.))) %>% head
iris %>% `colnames<-`(tolower(names(.))) %>% head # if you really want to use `colnames<-`
Using magrittr's "compound assignment pipe-operator" %<>% might be, if I understand your question correctly, an even more succinct option.
library("magrittr")
names(iris) %<>% tolower
?`%<>%` # for more
mtcars %>%
set_colnames(value = casefold(colnames(.), upper = FALSE)) %>%
head
casefold is available in base R and can convert in both direction, i.e. can convert to either all upper case or all lower case by using the flag upper, as need might be.
Also colnames() will use only column headers for case conversion.
You could also define a function:
upcase <- function(df) {
names(df) <- toupper(names(df))
df
}
library(dplyr)
mtcars %>% upcase %>% select(MPG)

Resources