Recently I have found the %$% pipe operator, but I am missing the point regarding its difference with %>% and if it could completely replace it.
Motivation to use %$%
The operator %$% could replace %>% in many cases:
mtcars %>% summary()
mtcars %$% summary(.)
mtcars %>% head(10)
mtcars %$% head(.,10)
Apparently, %$% is more usable than %>%:
mtcars %>% plot(.$hp, .$mpg) # Does not work
mtcars %$% plot(hp, mpg) # Works
Implicitly fills the built-in data argument:
mtcars %>% lm(mpg ~ hp, data = .)
mtcars %$% lm(mpg ~ hp)
Since % and $ are next to each other in the keyboard, inserting %$% is more convenient than inserting %>%.
Documentation
We find the following information in their respective help pages.
(?magrittr::`%>%`):
Description:
Pipe an object forward into a function or call expression.
Usage:
lhs %>% rhs
(?magrittr::`%$%`):
Description:
Expose the names in ‘lhs’ to the ‘rhs’ expression. This is useful
when functions do not have a built-in data argument.
Usage:
lhs %$% rhs
I was not able to understand the difference between the two pipe operators. Which is the difference between piping an object and exposing a name? But, in the rhs of %$%, we are able to get the piped object with the ., right?
Should I start using %$% instead of %>%? Which problems could I face doing so?
In addition to the provided comments:
%$% also called the Exposition pipe vs. %>%:
This is a short summary of this article https://towardsdatascience.com/3-lesser-known-pipe-operators-in-tidyverse-111d3411803a
"The key difference in using %$% or %>% lies in the type of arguments of used functions."
One advantage, and as far as I can understand it, for me the only one to use %$% over %>% is the fact that
we can avoid repetitive input of the dataframe name in functions that have no data as an argument.
For example the lm() has a data argument. In this case we can use both %>% and %$% interchangeable.
But in functions like the cor() which has no data argument:
mtcars %>% cor(disp, mpg) # Will give an Error
cor(mtcars$disp, mtcars$mpg)
is equivalent to
mtcars %$% cor(disp, mpg)
And note to use %$% pipe operator you have to load library(magrittr)
Update: on OPs comment:
The pipe independent which one allows us to transform machine or computer language to a more readable human language.
ggplot2 is special. ggplot2 is not internally consistent.
ggplot1 had a tidier API then ggplot2
Pipes would work with ggplot1:
library(ggplot1) mtcars %>% ggplot(list( x= mpg, y = wt)) %>% ggpoint() %>% ggsave("mtcars.pdf", width= 8 height = 6)
In 2016 Wick Hadley said:
"ggplot2 newver would have existed if I'd discovered the pipe 10 years earlier!"
https://www.youtube.com/watch?v=K-ss_ag2k9E&list=LL&index=9
No, you shouldn't use %$% routinely. It is like using the with() function, i.e. it exposes the component parts of the LHS when evaluating the RHS. But it only works when the value on the left has names like a list or dataframe, so you can't always use it. For example,
library(magrittr)
x <- 1:10
x %>% mean()
#> [1] 5.5
x %$% mean()
#> Error in eval(substitute(expr), data, enclos = parent.frame()): numeric 'envir' arg not of length one
Created on 2022-02-06 by the reprex package (v2.0.1.9000)
You'd get a similar error with x %$% mean(.).
Even when the LHS has names, it doesn't automatically put the . argument in the first position. For example,
mtcars %>% nrow()
#> [1] 32
mtcars %$% nrow()
#> Error in nrow(): argument "x" is missing, with no default
Created on 2022-02-06 by the reprex package (v2.0.1.9000)
In this case mtcars %$% nrow(.) would work, because mtcars has names.
Your example involving .$hp and .$mpg is illustrating one of the oddities of magrittr pipes. Because the . is only used in expressions, not alone as an argument, it is passed as the first argument as well as being passed in those expressions. You can avoid this using braces, e.g.
mtcars %>% {plot(.$hp, .$mpg)}
Related
According to the documentation of the dplyr package:
# The _if() variants apply a predicate function (a function that
# returns TRUE or FALSE) to determine the relevant subset of
# columns.
# mutate_if() is particularly useful for transforming variables from
# one type to another
iris %>% mutate_if(is.factor, as.character)
So how do I use the inverse form? I would like to transform all non-numeric values to characters, so I thought of doing:
iris %>% mutate_if(!is.numeric, as.character)
#> Error in !is.numeric : invalid argument type
But that doesn't work. Or just select all variables that are not numeric:
iris %>% select_if(!is.numeric)
#> Error in !is.numeric : invalid argument type
Doesn't work either.
How do I use negation with dplyr functions like mutate_if(), select_if() and arrange_if()?
EDIT: This might be solved in the upcoming dplyr 1.0.0: NEWS.md.
We can use shorthand notation ~ for anonymous function in tidyverse
library(dplyr)
iris %>%
mutate_if(~ !is.numeric(.), as.character)
Or without anonymous function, use negate from purrr
library(purrr)
iris %>%
mutate_if(negate(is.numeric), as.character)
In addition to negate, Negate from base R also works
iris %>%
mutate_if(Negate(is.numeric), as.character)
Same notation, works with select_if/arrange_if
iris %>%
select_if(negate(is.numeric))%>%
head(2)
# Species
#1 setosa
#2 setosa
Could be a nice suggestion to add to their package, so feel free to open an issue on GitHub.
For now, you can write a function 'on-the-fly':
iris %>% mutate_if(function(x) !is.numeric(x), as.character)
iris %>% select_if(function(x) !is.numeric(x))
And this might even be safer, not sure how the _if() internals work:
iris %>% mutate_if(function(...) !is.numeric(...), as.character)
iris %>% select_if(function(...) !is.numeric(...))
I figured this out while typing my question, but would like to see if there's a cleaner, less code way of doing what I want.
e.g. code block:
target <- "mpg"
# want
mtcars %>%
mutate(target := log(target))
I'd like to update mpg to be the log of mpg based on the variable target.
Looks like I got this working with:
mtcars %>%
mutate(!! rlang::sym(target) := log(!! rlang::sym(target)))
That just reads as pretty repetitive. Is there a 'cleaner', less code way of achieving the same result?
I'm fond of the double curly braces {{var}}, no reason, they are just nicer to read imho but I couldn't get the same results when I tried:
mtcars %>%
mutate(!! rlang::sym(target) := log({{target}}))
What are the various ways I can use tidyeval to mutate a field via transformation based on a pre determined variable to define which field to be transformed, in this case the variable 'target'?
On the lhs of :=, the string can be evaluated with just !!, while on the rhs, it is the value that we need, so we convert to symbol and evaluate (!!)
library(dplyr)
mtcars %>%
mutate(!!target := log(!! rlang::sym(target)))
1) Use mutate_at
library(dplyr)
mtcars %>% mutate_at(target, log)
2) We can use the magrittr %<>% operator:
library(magrittr)
mtcars[[target]] %<>% log
3) Of course this is trivial in base R:
mtcars[[target]] <- log(mtcars[[target]])
In R for data science Chapter 21.5.1, this syntax is used in base function split(.$cyl). Why the dot in .$cyl. The package purrr has a syntax for a placeholders (. or .x) but purrr is not involved.
library(tidyverse)
mtcars %>% split(f=.$cyl)
The placeholder syntax used by purrr is also used by the magrittr pipe (%>%). By default, the pipe passes the left-hand side (LHS) as the first argument of the function on the right-hand side (RHS). When this is the case the . is not necessary in the RHS expression.
For instance:
mtcars %>% str()
works fine and is equivalent to:
mtcars %>% str(.)
The . is in this case totally unnecessary because the LHS (mtcars) is the first argument passed to str().
So this is the same as:
str(mtcars)
But in any other situation, you need to use . to mark where, in the RHS, the LHS should be passed.
Your example is a little complex because the LHS (mtcars) is passed twice in the RHS (the function split()):
first, as the first argument (so no . needed)
then, again, as part of the 2nd argument (so you do need a . in that case).
mtcars %>% split(f = .$cyl)
could be written (though that is unnecessary) as:
mtcars %>% split(x = ., f = .$cyl)
and is thus in fact equivalent to:
split(x = mtcars, f = mtcars$cyl)
This is probably a simple question, but I'm having trouble getting the mean function to work using dplyr.
Using the mtcars dataset as an example, if I type:
data(mtcars)
mtcars %>%
select (mpg) %>%
mean()
I get the "Warning message:
In mean.default(.) : argument is not numeric or logical: returning NA" error message.
For some reason though if I repeat the same code but just ask for a "summary", or "range" or several other statistical calculations, they work fine:
data(mtcars)
mtcars %>%
select (mpg) %>%
summary()
Similarly, if I run the mean function in base R notation, that works fine too:
mean(mtcars$mpg)
Can anyone point out what I've done wrong?
Use pull to pull out the vector.
mtcars %>%
pull(mpg) %>%
mean()
# [1] 20.09062
Or use pluck from the purrr package.
mtcars %>%
purrr::pluck("mpg") %>%
mean()
# [1] 20.09062
Or summarize first and then pull out the mean.
mtcars %>%
summarize(mean = mean(mpg)) %>%
pull(mean)
# [1] 20.09062
In dplyr, you can use summarise() whenever you're not changing your original dataframe (reordering it, filtering it, adding to it, etc), but instead are creating a new dataframe that has summary statistics for the first dataframe.
mtcars %>%
summarise(mean_mpg = mean(mpg))
gives the output:
mean_mpg
1 20.09062
PS. If you're learning dplyr, learning these five verbs will take you a long way: select(), filter(), group_by(), summarise(), arrange().
I am new to dplyr() package and trying to use it for my visualization assignment. I am able to pipe my data to ggplot() but unable to do that with plot(). I came across this post and the answers including the one in comments, didn't work for me.
Code 1:
emission <- mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))
emission %>%
plot(year, total,.)
I get the following error:
Error in plot(year, total, emission) : object 'year' not found
Code 2:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
plot(year, total, .)
This didn't work either and returned the same error.
Interestingly, the solution from the post I mentioned works for the same dataset but doesn't work out for my own data. However, I am able to create the plot using emission$year and emission$total.
Am I missing anything?
plot.default doesn't take a data argument, so your best bet is to pipe to with:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
with(plot(year, total))
In case anyone missed #aosmith's comment on the question, plot.formula does have a data argument, but of course the formula is the first argument so we need to use the . to put the data in the right place. So another option is
... %>%
plot(total ~ year, data = .)
Of course, ggplot takes data as the first argument, so to use ggplot do:
... %>%
ggplot(aes(x = year, y = total)) + geom_point()
lattice::xyplot is likeplot.formula: there is a data argument, but it's not first, so:
... %>%
xyplot(total ~ year, data = .)
Just look at the documentation and make sure you use a . if data isn't the first argument. If there's no data argument at all, using with is a good work-around.
As an alternative, you can use the %$% operator from magrittr to be able to access the columns of a dataframe directly. For example:
iris %$%
plot(Sepal.Length~Sepal.Width)
This is useful many times when you need to feed the result of a dplyr chain to a base R function (such as table, lm, plot, etc). It can also be used to extract a column from a dataframe as a vector, e.g.:
iris %>% filter(Species=='virginica') %$% Sepal.Length
This is the same as:
iris %>% filter(Species=='virginica') %>% pull(Sepal.Length)