How does R ggplot2 get the column names via aes? - r

I understand how to use aes, but I don't understand the programmatic paradigm.
When I use ggplot, assuming I have a data.frame with column names "animal" and "weight", I can do the following.
ggplot(df, aes(x=weight)) + facet_grid(~animal) + geom_histogram()
What I don't understand is that weight and animal are not supposed to be strings, they are just typed out as is. How is it I can do that? It should be something like this instead:
ggplot(df, aes(x='weight')) + facet_grid('~animal') + geom_histogram()
I don't "declare" weight or animal as vectors anywhere? This seems to be... really unusual? Is this like a macro or something where it gets aes "whole," looks into df for its column names, and then fills in the gaps where it sees those variable names in aes?
I guess what I would like is to see some similar function in R which can take variables which are not declared in the scope, and the name of this feature, so I can read further and maybe implement my own similar functions.

In R this is called non-standard evaluation. There is a chapter on non-standard evaluation in R in the Advanced R book available free online. Basically R can look at the the call stack to see the symbol that was passed to the function rather than just the value that symbol points to. It's used a lot in base R. And it's used in a slightly different way in the tidyverse which has a formal class called a quosure to make this stuff easier to work with.
These methods are great for interactive programming. They save keystrokes and clutter, but if you make functions that are too dependent on that function, they become difficult to script or include in other functions.
The formula syntax (the one with the ~) probably the safest and more programatic way to work with symbols. It captures symbols that can be later evaluated in the context of a data.frame with functions like model.frame(). And there are build in functions to help manipulate formulas like update() and reformulate.
And since you were explicitly interested in the aes() call, you can get the source code for any function in R just by typing it's name without the quotes. With ggplot2_2.2.1, the function looks like this
aes
# function (x, y, ...)
# {
# aes <- structure(as.list(match.call()[-1]), class = "uneval")
# rename_aes(aes)
# }
# <environment: namespace:ggplot2>
The newest version of ggplot uses different rlang methods to be more consistent with other tidyverse libraries so it looks a bit different.

Related

Where is the Purrr ~ operator documented?

I searched for ??"~" but this only points me to rlang::env_bind (presumably, %<~%) and base::~. Within RStudio, how can I find Purrr's ~'s documentation? For example, if I forgot how to use ~ with two inputs, where do I look?
There is a good explanation given AdvanceR (link given in another answer). There is also a short description (usage example) given in purrr cheatsheat first page bottom left.
The usage of multiple arguments with twiddle ~ may be seen at with documentation of purrr given in its different functions. e.g. map see argument description which states that
.f
A function, formula, or vector (not necessarily atomic).
If a function, it is used as is.
If a formula, e.g. ~ .x + 2, it is converted to a function. There are three ways to refer to the arguments:
For a single argument function, use .
For a two argument function, use .x and .y
For more arguments, use ..1, ..2, ..3 etc
This syntax allows you to create very compact anonymous functions.
Moreover, R in its newest version (4.1.0) has also started similar kind of shorthand notation of functions
R now provides a shorthand notation for creating functions, e.g. \(x) x + 1 is parsed as function(x) x + 1.
This shorthand notation may also provide useful in functions outside tidyverse, with only differentiation from twiddle being that here arguments are not named by default. But again, this non-default naming may also be proved useful when one invisible function is to be used inside another and twiddle style of notation will not work in that case.
When you use ~ within the context of purrr functions, it will be passed to the as_mapper() function, which in turn passes on to the as_function() function from rlang. These help files have a very basic bit of what is needed to use this. This is further documented in the Advanced R Book Chapter 9, Section 9.22, which has a few good examples, and this chapter goes on to continue those ideas.

R statistics programming : using magrittr piping to pass 2 parameters to function

I am using magrittr, and was able to pass one variable to an R function via pipes from magrittr, and also pick which parameter to place where in the situation of multivariable function : F(x,y,z,...)
But i want to pass 2 parameters at the same time.
For example, i will using Select function from dplyr and pass in tableName and ColumnName:
I thought i could do it like this:
tableName %>% ColumnName %>% select(.,.)
But this did not work.
Hope someone can help me on this.
EDIT :
Some below are saying that this is a duplicate of a link provided by others.
But based on the algebra structure of the magrittr definition of Pipe for multivariable functions, it should be "doable" just based on the algebra definition of the pipe function.
The link provided by others, goes beyond the base definition and employs other external functions and or libraries to try to achieve passing multiple parameter to the function.
I am looking for a solution, IF POSSIBLE, just using the magrittr library and other base operations.
So this is the restriction that is placed on this problem.
In most of my university courses in math and computer science we were restricted to use only those things taught in the course. So when I said I am using dplyr and magrittr, that should imply that those are the only things one is permitted to use, so its under this constraint.
Hope this clarifies the scope of possible solutions here.
And if it's not possible to do this via just these libraries I want someone to tell me that it cannot be done.
I think you need a little more detail about exactly what you want, but as I understand the problem, I think one solution might be:
list(x = tableName, y = "ColumnName") %>% {select(eval(.$x),.$y) }
This is just a modification of the code linked in the chat. The issue with other implementations is that the first and second inputs to select() must be of specific (and different) types. So just plugging in two strings or two objects won't work.
In the same spirit, you can also use either:
list(x = "tableName", y = "ColumnName") %>% { select(get(.$x),.$y) }
or
list(tableName, "ColumnName") %>% do.call("select", .).
Note, however, that all of these functions (i.e., get(), eval(), and do.call()) have an environment specification in them and could result in errors if improperly specified. They work just fine in these examples because everything is happening in the global environment, but that might change if they were, e.g., called in a function.

How to have functions chaining in R like in c# with linq we have method chaining?

I am a new-bee to R one thing I noticed in R that we need to keep on saving the result to the variable each time before further processing is required. Is there some way where I can store the result to some buffer and later on use this buffer result in further processing.
For people who are familiar with c# using LINQ we have a feature called Method Chaining, here we keep on passing the intermediate result to various functions on the fly without the need of storing them into separate variables and in the end, we get the required output.This saves lots of extra syntax, so is there something like this in R?
Function composition is to functional programming as method chaining is to object-oriented programming.
x <- foo(bar(baz(y)))
is basically the same as
x = baz(y).bar().foo()
in the languages you might be familiar with.
If you're uncomfortable with nested parens and writing things backwards, the magrittr package provides the %>% operator to unpack expressions:
library(magrittr)
x = y %>% baz() %>% bar() %>% foo()
R also provides a couple of frameworks for conventional OO programming: reference classes and R6. With those, you can write something like
x = y$baz()$bar()$foo()
but I'd suggest learning how to deal with "normal" R expressions first.
In R we have something called Pipes(%>%) through which one can send the output of one function to another, i.e output from one function becomes input for subsequent function in the chain.
Try something like in this in R console Consider a tibble MyData containing Username and pwd as two columns u can use pipes as:
MyData %>%
select(username,pwd)
%>%
filter(!is.na(username))%>%
arrange(username).
This will print all the usernames and pwd sorted by username that contains non NA's value
Hope that helps

When to use 'with' function and why is it good?

What are the benefits of using with()? In the help file it mentions it evaluates the expression in an environment it creates from the data. What are the benefits of this? Is it faster to create an environment and evaluate it in there as opposed to just evaluating it in the global environment? Or is there something else I'm missing?
with is a wrapper for functions with no data argument
There are many functions that work on data frames and take a data argument so that you don't need to retype the name of the data frame for every time you reference a column. lm, plot.formula, subset, transform are just a few examples.
with is a general purpose wrapper to let you use any function as if it had a data argument.
Using the mtcars data set, we could fit a model with or without using the data argument:
# this is obviously annoying
mod = lm(mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$wt)
# this is nicer
mod = lm(mpg ~ cyl + disp + wt, data = mtcars)
However, if (for some strange reason) we wanted to find the mean of cyl + disp + wt, there is a problem because mean doesn't have a data argument like lm does. This is the issue that with addresses:
# without with(), we would be stuck here:
z = mean(mtcars$cyl + mtcars$disp + mtcars$wt)
# using with(), we can clean this up:
z = with(mtcars, mean(cyl + disp + wt))
Wrapping foo() in with(data, foo(...)) lets us use any function foo as if it had a data argument - which is to say we can use unquoted column names, preventing repetitive data_name$column_name or data_name[, "column_name"].
When to use with
Use with whenever you like interactively (R console) and in R scripts to save typing and make your code clearer. The more frequently you would need to re-type your data frame name for a single command (and the longer your data frame name is!), the greater the benefit of using with.
Also note that with isn't limited to data frames. From ?with:
For the default with method this may be an environment, a list, a data frame, or an integer as in sys.call.
I don't often work with environments, but when I do I find with very handy.
When you need pieces of a result for one line only
As #Rich Scriven suggests in comments, with can be very useful when you need to use the results of something like rle. If you only need the results once, then his example with(rle(data), lengths[values > 1]) lets you use the rle(data) results anonymously.
When to avoid with
When there is a data argument
Many functions that have a data argument use it for more than just easier syntax when you call it. Most modeling functions (like lm), and many others too (ggplot!) do a lot with the provided data. If you use with instead of a data argument, you'll limit the features available to you. If there is a data argument, use the data argument, not with.
Adding to the environment
In my example above, the result was assigned to the global environment (bar = with(...)). To make an assignment inside the list/environment/data, you can use within. (In the case of data.frames, transform is also good.)
In packages
Don't use with in R packages. There is a warning in help(subset) that could apply just about as well to with:
Warning This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
If you build an R package using with, when you check it you will probably get warnings or notes about using variables without a visible binding. This will make the package unacceptable by CRAN.
Alternatives to with
Don't use attach
Many (mostly dated) R tutorials use attach to avoid re-typing data frame names by making columns accessible to the global environment. attach is widely considered to be bad practice and should be avoided. One of the main dangers of attach is that data columns can become out of sync if they are modified individually. with avoids this pitfall because it is invoked one expression at a time. There are many, many questions on Stack Overflow where new users are following an old tutorial and run in to problems because of attach. The easy solution is always don't use attach.
Using with all the time seems too repetitive
If you are doing many steps of data manipulation, you may find yourself beginning every line of code with with(my_data, .... You might think this repetition is almost as bad as not using with. Both the data.table and dplyr packages offer efficient data manipulation with non-repetitive syntax. I'd encourage you to learn to use one of them. Both have excellent documentation.
I use it when i don't want to keep typing dataframe$. For example
with(mtcars, plot(wt, qsec))
rather than
plot(mtcars$wt, mtcars$qsec)
The former looks up wt and qsec in the mtcars data.frame. Of course
plot(qsec~wt, data = mtcars)
is more appropriate for plot or other functions that take a data= argument.

What do . (dot) and % (percentage) mean in R?

My question might sound stupid but I have noticed that . and % is often used in R and to be frank I don't really know why it is used.
I have seen it in dplyr (go here for an example) and data.table (i.e. .SD) but I am sure it must be used in other place as well.
Therefore, my question is:
What does . mean? Is it some kind of R coding best practice nomenclature? (i.e. _functionName is often used in javascript to indicate it is a private function). If yes, what's the rule?
Same question for %, which is also often used in R (i.e. %in%,%>%,...).
My guess always has been that . and % are a convenient way to quickly call function but the way data.table uses . does not follow this logic, which confuses me.
. has no inherent/magical meaning in R. It's just another character that you can use in symbol names. But because it is so convenient to type, it has been given special meaning by certain functions and conventions in R. Here are just a few
. is used look up S3 generic method implementations. For example, if you call a generic function like plot with an object of class lm as the first parameter, then it will look for a function named plot.lm and, if found, call that.
often . in formulas means "all other variables", for example lm(y~., data=dd) will regress y on all the other variables in the data.frame dd.
libraries like dplyr use it as a special variable name to indicate the current data.frame for methods like do(). They could just as easily have chosen to use the variable name X instead
functions like bquote use .() as a special function to escape variables in expressions
variables that start with a period are considered "hidden" and will not show up with ls() unless you call ls(all.names=TRUE) (similar to the UNIX file system behavior)
However, you can also just define a variable named my.awesome.variable<-42 and it will work just like any other variable.
A % by itself doesn't mean anything special, but R allows you to define your own infix operators in the form %<something>% using two percent signs. If you define
`%myfun%` <- function(a,b) {
a*3-b*2
}
you can call it like
5 %myfun% 2
# [1] 11
MrFlick's answer doesn't cover the usage of . in data.table;
In data.table, . is (essentially) an alias for list, so any* call to [.data.table that accepts a list can also be passed an object wrapped in .().
So the following are equivalent:
DT[ , .(x, y)]
DT[ , list(x, y)]
*well, not quite. any use in the j argument, yes; elsewhere is a work in progress, see here.

Resources