This starts as an aestethic question but then turns into a functional one, specifically about magrittr.
I want to add a data_frame which is manually input to one that is already there as so:
cars_0 <- mtcars %>%
mutate(brand = row.names(.)) %>%
select(brand, mpg, cyl)
new_cars <- matrix(ncol = 3, byrow = T, c(
"VW Beetle", 25, 4,
"Peugeot 406", 42, 6)) # Coercing types is not an issue here.
cars_1 <- rbind(cars_0,
set_colnames(new_cars, names(cars_0)))
I'm writing the new cars in a matrix for "increased legibility", and therefore need to set the column names for it to be bound to cars_0.
If anyone likes magrittr as much as I do, they might want to present new_cars first and pipe it to set_colnames
cars_1 <- rbind(cars_0, new_cars %>%
set_colnames(names(cars_0)))
Or to avoid repetition they'll want to indicate cars_0 and pipe it to rbind
cars_1 <- cars_0 %>%
rbind(., set_colnames(new_cars, names(.)))
However one cannot do both as there is confusion about whom is being piped
cars_1 <- cars_0 %>%
rbind(., new_cars %>% set_colnames(names(.)))
## Error in match.names(clabs, names(xi)) :
## names do not match previous names
My question: Is there a way to distinguish the two arguments that are piped?
Short answer: no.
Longer answer: I'm not sure what the rationale for doing this would be. The philosophy behind magrittr was to unnest composite functions, with the primary intent of making it easier to read the code. For example:
f(g(h(x)))
becomes
h(x) %>% g() %>% f()
Trying to use pipes in a manner that places two objects to be interpreted as the . argument goes against the philosophy of simplification. There are circumstances in which you can have nested pipes, but the environments ought to remain distinct. Trying to cross two pipes in the same environment can be likened to crossing the streams.
Don't cross the streams :)
Related
In a previous question I wanted to carry out case_when with a dynamic number of cases. The solution was to use parse_exprs along with !!!. I am looking for a similar solution to mutate/summarise with a dynamic number of columns.
Consider the following dataset.
library(dplyr)
library(rlang)
data(mtcars)
mtcars = mtcars %>%
mutate(g2 = ifelse(gear == 2, 1, 0),
g3 = ifelse(gear == 3, 1, 0),
g4 = ifelse(gear == 4, 1, 0))
Suppose I want to sum the columns g2, g3, g4. If I know these are the columns names then this is simple, standard dplyr:
answer = mtcars %>%
summarise(sum_g2 = sum(g2),
sum_g3 = sum(g3),
sum_g4 = sum(g4))
But suppose I do not know how many columns there are, or their exact names. Instead, I have a vector containing all the column names I care about. Following the logic in the accepted answer of my previous approach I would use:
columns_to_sum = c("g2","g3","g4")
formulas = paste0("sum_",columns_to_sum," = sum(",columns_to_sum,")")
answer = mtcars %>%
summarise(!!!parse_exprs(formulas))
If this did work, then regardless of the column names provided as input in columns_to_sum, I should receive the sum of the corresponding columns. However, this is not working. Instead of a column named sum_g2 containing sum(g2) I get a column called "sum_g2 = sum(g2)" and every value in this column is a zero.
Given that I can pass formulas into case_when it seems like I should be able to pass formulas into summarise (and the same idea should also work for mutate because they all use the rlang package).
In the past there were string versions of mutate and summarise (mutate_ and summarise_) that you could pass formulas to as strings. But these have been retired as the rlang approach is the intended approach now. The related questions I reviewed on Stackoverflow did not use the rlang quotation approach and hence are not sufficient for my purposes.
How do I summarise with a dynamic number of columns (using an rlang approach)?
One option since dplyr 1.0.0 could be:
mtcars %>%
summarise(across(all_of(columns_to_sum), sum, .names = "sum_{col}"))
sum_g2 sum_g3 sum_g4
1 0 15 12
Your attempt gives the correct answer but do not give column names as expected.
Here's an approach using map to get the names correct :
library(dplyr)
library(rlang)
library(purrr)
map_dfc(columns_to_sum, ~mtcars %>%
summarise(!!paste0('sum_', .x) := sum(!!sym(.x))))
# sum_g2 sum_g3 sum_g4
#1 0 15 12
You can also use this simple base R approach without any NSE-stuff :
setNames(data.frame(t(colSums(mtcars[columns_to_sum]))),
paste0('sum_', columns_to_sum))
and same in dplyr way :
mtcars %>%
summarise(across(all_of(columns_to_sum), sum)) %>%
set_names(paste0('sum_', columns_to_sum))
I am interested in the ability to pass a string not as an argument within a function but as an entire function. This may not be the smartest approach but I am simply curious so that I can understand the functionality between dplyr and how R interprets strings. Perhaps I am missing something very obvious but here are my attempts:
#what i want----
library(dplyr)
mtcars %>% count()
#replicate by passing string as count---
#feed string as a function
my_string = "count()"
#attempt 1
mtcars %>% my_string
#attempt 2
mtcars %>% eval(noquote(my_string))
#neither of the attempts work
If this is not possible I understand, but it would be interesting if possible as I can see some applications for this in my mind.
EDIT
A little more to explain why I want to do this. I have worked with fst files for some time for some very large data and load data into my environment like so, often performing operations on one file at a time and in parallel which is very efficient for my purposes:
#pseudo code---
seq.Date(1,2,by = "days") %>%
pblapply(function(x){
read.fst(list.files(as.character(x), as.data.table = T) %>%
#this portion turn into a string----
group_by(foo) %>%
count()
#------------------------------
}) %>% rbindlist()
#application-------
my_string = "group_by(foo) %>%
count()"
seq.Date(1,2,by = "days") %>%
pblapply(function(x){
read.fst(list.files(as.character(x), as.data.table = T) %>% my_string
}) %>% rbindlist()
I use data table more often but I think dplyr might be better for this specific task I am interested with. What I want to be able to do is separately write out the entire pipeline as a string and then pass it. This will allow me to write out a library package to shorten my workflow. Something to that effect.
I'm trying to write a function that adjusts the grouping vars to exclude a single grouping var. The function is always passed a grouped tibble. The first part of the function does some calculations at the grouping level it's supplied. The second part does additional calculations, but needs to exclude a single grouping var that's dynamic in my data. Using the mtcars as a sample dataset:
library(tidyverse)
# x is a grouped tibble, my_col is the column to peel
my_function <- function(x, my_col){
my_col_enc <- enquo(my_col)
# Trying to grab the groups and then peel off the column
x_grp <- x %>% group_vars()
excluded <- x_grp[!is.element(x_grp, as.character(my_col_enc))]
# My calculations are two-tiered as described in the original description
# simplifying for example
x %>% group_by(excluded) %>% tally()
}
# This should be equivalent to mtcars %>% group_by(gear) %>% tally()
mtcars %>% group_by(cyl, gear) %>% my_function(cyl)
When I run this, I get an Error: Column 'excluded' is unknown.
Edit:
For any future searchers with this issue, if you have a character vector (i.e. multiple grouping vars), you may need to use syms with !!! to achieve what my original question was asking for.
Here's what you're looking for:
library(tidyverse)
my_function <- function(x, my_col){
my_col_enc <- enquo(my_col)
# Trying to grab the groups and then peel off the column
x_grp <- x %>% group_vars()
# here, make sure this is a symbol, else it'll group as character later (e.g. 'gear')
excluded <- rlang::sym(x_grp[!is.element(x_grp, as.character(my_col_enc))])
# need to use !'s to deal with the symbol
x %>% group_by(!!excluded) %>% tally()
}
I commented the code, but you're first problem was that your excluded variable wasn't recognized: to make indirect references to columns, it is necessary to modify the quoted code before it gets evaluated. Do this with the !! (pronounced 'bang bang') operator.
Adding just that to your code won't completely solve it, because excluded is a character. It needs to be treated as a symbol, hence the rlang::sym() function wrapping its declaration.
I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]
I recently discovered the pipe operator %>%, which can make code more readable. Here is my MWE.
library(dplyr) # for the pipe operator
library(lsr) # for the cohensD function
set.seed(4) # make it reproducible
dat <- data.frame( # create data frame
subj = c(1:6),
pre = sample(1:6, replace = TRUE),
post = sample(1:6, replace = TRUE)
)
dat %>% select(pre, post) %>% sapply(., mean) # works as expected
However, I struggle using the pipe operator in this particular case
dat %>% select(pre, post) %>% cohensD(.$pre, .$post) # piping returns an error
cohensD(dat$pre, dat$post) # classical way works fine
Why is it not possible to subset columns using the placeholder .in combination with $? Is it worthwhile to write this line using a pipe operator %>%, or does it complicate syntax? The classical way of writing this seems more concise.
This would work:
dat %>% select(pre, post) %>% {cohensD(.$pre, .$post)}
Wrapping the last call into curly braces makes it be treated like an expression and not a function call. When you pipe something into an expression, the . gets replaced as expected. I often use this trick to call a function which does not interface well with piping.
What is inside the braces happens to be a function call but could really be any expression of . .
Since you're going from a bunch of data into one (row of) value(s), you're summarizing. in a dplyr pipeline you can then use the summarize function, within the summarize function you don't need to subset and can just call pre and post
Like so:
dat %>% select(pre, post) %>% summarize(CD = cohensD(pre, post))
(The select statement isn't actually necessary in this case, but I left it in to show how this works in a pipeline)
It doesn't work because the . operator has to be used directly as an argument, and not inside a nested function (like $...) in your call.
If you really want to use piping, you can do it with the formula interface, but with a little reshaping before (melt is from reshape2 package):
dat %>% select(pre, post) %>% melt %>% cohensD(value~variable, .)
#### [1] 0.8115027