An earlier question I had (with answer) illustrates how the {} wrapper prevents piping to the first possible argument. Now I'm playing with this idea in the following manner.
# this all works
library(tidyverse)
mt <- mtcars %>% count(cyl)
seq_along(mt$cyl)
That code chunk aboves works. Neither of the two below work. I get an error, "Error in function_list[k] : object 'cyl' not found". What did I do wrong this time?
# does not work
mtcars %>%
count(cyl) %>%
{seq_along(cyl)}
#does not work
mtcars %>%
count(cyl) %>%
seq_along(cyl)
If none of my stuff makes sense all I really need is the simplest example of how the {} wrapper works with dplyr. Thank you.
You would need
mtcars %>%
count(cyl) %>%
{seq_along(.$cyl)}
The object is still passed as . with the braces, but it's not automatically inserted in to the first parameter.
In your first case
mtcars %>%
count(cyl) %>%
{seq_along(cyl)}
is the same as these two separate commands
count(mtcars, cyl)
seq_along(cyl)
because you never actually use anything from the chain. And your second case
mtcars %>%
count(cyl) %>%
seq_along(cyl)
is the same as
seq_along(count(mtcase, cyl), cyl)
which doesn't work because seq_along isn't a tidy-friendly function in that it doesn't accept a data source as the first parameter.
Related
Recently I have found the %$% pipe operator, but I am missing the point regarding its difference with %>% and if it could completely replace it.
Motivation to use %$%
The operator %$% could replace %>% in many cases:
mtcars %>% summary()
mtcars %$% summary(.)
mtcars %>% head(10)
mtcars %$% head(.,10)
Apparently, %$% is more usable than %>%:
mtcars %>% plot(.$hp, .$mpg) # Does not work
mtcars %$% plot(hp, mpg) # Works
Implicitly fills the built-in data argument:
mtcars %>% lm(mpg ~ hp, data = .)
mtcars %$% lm(mpg ~ hp)
Since % and $ are next to each other in the keyboard, inserting %$% is more convenient than inserting %>%.
Documentation
We find the following information in their respective help pages.
(?magrittr::`%>%`):
Description:
Pipe an object forward into a function or call expression.
Usage:
lhs %>% rhs
(?magrittr::`%$%`):
Description:
Expose the names in ‘lhs’ to the ‘rhs’ expression. This is useful
when functions do not have a built-in data argument.
Usage:
lhs %$% rhs
I was not able to understand the difference between the two pipe operators. Which is the difference between piping an object and exposing a name? But, in the rhs of %$%, we are able to get the piped object with the ., right?
Should I start using %$% instead of %>%? Which problems could I face doing so?
In addition to the provided comments:
%$% also called the Exposition pipe vs. %>%:
This is a short summary of this article https://towardsdatascience.com/3-lesser-known-pipe-operators-in-tidyverse-111d3411803a
"The key difference in using %$% or %>% lies in the type of arguments of used functions."
One advantage, and as far as I can understand it, for me the only one to use %$% over %>% is the fact that
we can avoid repetitive input of the dataframe name in functions that have no data as an argument.
For example the lm() has a data argument. In this case we can use both %>% and %$% interchangeable.
But in functions like the cor() which has no data argument:
mtcars %>% cor(disp, mpg) # Will give an Error
cor(mtcars$disp, mtcars$mpg)
is equivalent to
mtcars %$% cor(disp, mpg)
And note to use %$% pipe operator you have to load library(magrittr)
Update: on OPs comment:
The pipe independent which one allows us to transform machine or computer language to a more readable human language.
ggplot2 is special. ggplot2 is not internally consistent.
ggplot1 had a tidier API then ggplot2
Pipes would work with ggplot1:
library(ggplot1) mtcars %>% ggplot(list( x= mpg, y = wt)) %>% ggpoint() %>% ggsave("mtcars.pdf", width= 8 height = 6)
In 2016 Wick Hadley said:
"ggplot2 newver would have existed if I'd discovered the pipe 10 years earlier!"
https://www.youtube.com/watch?v=K-ss_ag2k9E&list=LL&index=9
No, you shouldn't use %$% routinely. It is like using the with() function, i.e. it exposes the component parts of the LHS when evaluating the RHS. But it only works when the value on the left has names like a list or dataframe, so you can't always use it. For example,
library(magrittr)
x <- 1:10
x %>% mean()
#> [1] 5.5
x %$% mean()
#> Error in eval(substitute(expr), data, enclos = parent.frame()): numeric 'envir' arg not of length one
Created on 2022-02-06 by the reprex package (v2.0.1.9000)
You'd get a similar error with x %$% mean(.).
Even when the LHS has names, it doesn't automatically put the . argument in the first position. For example,
mtcars %>% nrow()
#> [1] 32
mtcars %$% nrow()
#> Error in nrow(): argument "x" is missing, with no default
Created on 2022-02-06 by the reprex package (v2.0.1.9000)
In this case mtcars %$% nrow(.) would work, because mtcars has names.
Your example involving .$hp and .$mpg is illustrating one of the oddities of magrittr pipes. Because the . is only used in expressions, not alone as an argument, it is passed as the first argument as well as being passed in those expressions. You can avoid this using braces, e.g.
mtcars %>% {plot(.$hp, .$mpg)}
I figured this out while typing my question, but would like to see if there's a cleaner, less code way of doing what I want.
e.g. code block:
target <- "mpg"
# want
mtcars %>%
mutate(target := log(target))
I'd like to update mpg to be the log of mpg based on the variable target.
Looks like I got this working with:
mtcars %>%
mutate(!! rlang::sym(target) := log(!! rlang::sym(target)))
That just reads as pretty repetitive. Is there a 'cleaner', less code way of achieving the same result?
I'm fond of the double curly braces {{var}}, no reason, they are just nicer to read imho but I couldn't get the same results when I tried:
mtcars %>%
mutate(!! rlang::sym(target) := log({{target}}))
What are the various ways I can use tidyeval to mutate a field via transformation based on a pre determined variable to define which field to be transformed, in this case the variable 'target'?
On the lhs of :=, the string can be evaluated with just !!, while on the rhs, it is the value that we need, so we convert to symbol and evaluate (!!)
library(dplyr)
mtcars %>%
mutate(!!target := log(!! rlang::sym(target)))
1) Use mutate_at
library(dplyr)
mtcars %>% mutate_at(target, log)
2) We can use the magrittr %<>% operator:
library(magrittr)
mtcars[[target]] %<>% log
3) Of course this is trivial in base R:
mtcars[[target]] <- log(mtcars[[target]])
I am researching how to use R function on line but still have hard time figuring out. Please help.
My initial code looks like:
whatever %>%
group_by(a) %>%
summarize(count=n()) %>%
collect() %>%
ggplot(aes(x=a, y=count)) +
geom_point()
I want to repeat this multiple times since there are other columns I want to check with the same function.
So I wrote:
point_dist <- function(dta, vari) {
dta %>%
group_by(vari) %>%
summarize(count=n()) %>%
collect() %>%
ggplot(aes(x=vari, y=count)) +
gemo_point()
}
point_dist(whatever, a)
but keep telling me:
Error in eval_bare(sym, env) : object 'a' not found
Don't know why.
I either don't know if this is the right direction I shall go.
Thanks again.
Your issue is related to non-standard evaluation that dplyr functions tend to give you. When you reference a in your first call to point_dist, R attempts to evaluate it, which of course fails. (It's even more confusing when you have some variable named as such in your calling environment or higher ...)
NSE in dplyr means you can do something like select(mtcars, cyl), whereas with most standard-evaluation functions, you'll need myfunc(mtcars, "cyl"), since there isn't a variable named cyl in the calling environment.
In your case, try:
point_dist <- function(dta, vari) {
vari <- enquo(vari)
dta %>%
group_by(!!vari) %>%
summarize(count=n()) %>%
collect() %>%
ggplot(aes(x=!!vari, y=count)) +
gemo_point()
}
This method of dealing with unquoted column-names in your functions can be confusing if you're familiar with normal R function definitions and/or are not familiar with NSE. This can be a good template for you if that's as far as you're going to go with it, otherwise I strongly urge you to read a little more at the first reference below.
Some good references for NSE, specifically in/around tidyverse stuff:
https://dplyr.tidyverse.org/articles/programming.html
http://adv-r.had.co.nz/Computing-on-the-language.html
If you are summarising your data and piping to ggplot, then you don't need to use collect().
df <- data.frame(group=sample(letters[1:10],1000,T))
df %>% group_by(group) %>% summarise(n=n()) %>%
ggplot(aes(group,n)) + geom_point()
If you are going to apply this summary and plot method to multiple columns, I would suggest trying gather() and then plotting everything at once using + facet_wrap() and bar plots.
df <- data.frame(matrix(sample(letters[1:10],10000,T),ncol = 10))
df %>% gather(k,v) %>% group_by(k,v) %>% summarise(n=n()) %>%
ggplot(aes(k,n,fill=v)) + geom_bar(stat='identity') +
facet_wrap(~v) + theme(legend.position = 'none')
This is probably a simple question, but I'm having trouble getting the mean function to work using dplyr.
Using the mtcars dataset as an example, if I type:
data(mtcars)
mtcars %>%
select (mpg) %>%
mean()
I get the "Warning message:
In mean.default(.) : argument is not numeric or logical: returning NA" error message.
For some reason though if I repeat the same code but just ask for a "summary", or "range" or several other statistical calculations, they work fine:
data(mtcars)
mtcars %>%
select (mpg) %>%
summary()
Similarly, if I run the mean function in base R notation, that works fine too:
mean(mtcars$mpg)
Can anyone point out what I've done wrong?
Use pull to pull out the vector.
mtcars %>%
pull(mpg) %>%
mean()
# [1] 20.09062
Or use pluck from the purrr package.
mtcars %>%
purrr::pluck("mpg") %>%
mean()
# [1] 20.09062
Or summarize first and then pull out the mean.
mtcars %>%
summarize(mean = mean(mpg)) %>%
pull(mean)
# [1] 20.09062
In dplyr, you can use summarise() whenever you're not changing your original dataframe (reordering it, filtering it, adding to it, etc), but instead are creating a new dataframe that has summary statistics for the first dataframe.
mtcars %>%
summarise(mean_mpg = mean(mpg))
gives the output:
mean_mpg
1 20.09062
PS. If you're learning dplyr, learning these five verbs will take you a long way: select(), filter(), group_by(), summarise(), arrange().
I can summarise a data frame with dplyr like this:
mtcars %>%
group_by(cyl) %>%
summarise(mean(mpg))
To convert the output back to class data.frame, my current approach is this:
as.data.frame(mtcars %>%
group_by(cyl) %>%
summarise(mean(mpg)))
Is there any way to get dplyr to output a class data.frame without having to use as.data.frame?
As was pointed out in the comments you might not need to convert it since it might be good enough that it inherits from data frame. If that is not good enough then this still uses as.data.frame but is slightly more elegant:
mtcars %>%
group_by(cyl) %>%
summarise(mean(mpg)) %>%
ungroup %>%
as.data.frame()
ADDED I just read in the comments that the reason you want this is to avoid the truncation of printed output. In that case just define this option, possibly in your .Rprofile file:
options(dplyr.print_max = Inf)
(Note that you can still hit the maximum defined by the "max.print" option associated with print so you would need to set that one too if it's also too low for you.)
Update: Changed %.% to %>% to reflect changes in dplyr.
In addition to what G. Grothendieck mentioned above, you can convert it into a new dataframe:
new_summary <- mtcars %>%
group_by(cyl) %>%
summarise(mean(mpg)) %>%
as.data.frame()