What are the benefits of using with()? In the help file it mentions it evaluates the expression in an environment it creates from the data. What are the benefits of this? Is it faster to create an environment and evaluate it in there as opposed to just evaluating it in the global environment? Or is there something else I'm missing?
with is a wrapper for functions with no data argument
There are many functions that work on data frames and take a data argument so that you don't need to retype the name of the data frame for every time you reference a column. lm, plot.formula, subset, transform are just a few examples.
with is a general purpose wrapper to let you use any function as if it had a data argument.
Using the mtcars data set, we could fit a model with or without using the data argument:
# this is obviously annoying
mod = lm(mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$wt)
# this is nicer
mod = lm(mpg ~ cyl + disp + wt, data = mtcars)
However, if (for some strange reason) we wanted to find the mean of cyl + disp + wt, there is a problem because mean doesn't have a data argument like lm does. This is the issue that with addresses:
# without with(), we would be stuck here:
z = mean(mtcars$cyl + mtcars$disp + mtcars$wt)
# using with(), we can clean this up:
z = with(mtcars, mean(cyl + disp + wt))
Wrapping foo() in with(data, foo(...)) lets us use any function foo as if it had a data argument - which is to say we can use unquoted column names, preventing repetitive data_name$column_name or data_name[, "column_name"].
When to use with
Use with whenever you like interactively (R console) and in R scripts to save typing and make your code clearer. The more frequently you would need to re-type your data frame name for a single command (and the longer your data frame name is!), the greater the benefit of using with.
Also note that with isn't limited to data frames. From ?with:
For the default with method this may be an environment, a list, a data frame, or an integer as in sys.call.
I don't often work with environments, but when I do I find with very handy.
When you need pieces of a result for one line only
As #Rich Scriven suggests in comments, with can be very useful when you need to use the results of something like rle. If you only need the results once, then his example with(rle(data), lengths[values > 1]) lets you use the rle(data) results anonymously.
When to avoid with
When there is a data argument
Many functions that have a data argument use it for more than just easier syntax when you call it. Most modeling functions (like lm), and many others too (ggplot!) do a lot with the provided data. If you use with instead of a data argument, you'll limit the features available to you. If there is a data argument, use the data argument, not with.
Adding to the environment
In my example above, the result was assigned to the global environment (bar = with(...)). To make an assignment inside the list/environment/data, you can use within. (In the case of data.frames, transform is also good.)
In packages
Don't use with in R packages. There is a warning in help(subset) that could apply just about as well to with:
Warning This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
If you build an R package using with, when you check it you will probably get warnings or notes about using variables without a visible binding. This will make the package unacceptable by CRAN.
Alternatives to with
Don't use attach
Many (mostly dated) R tutorials use attach to avoid re-typing data frame names by making columns accessible to the global environment. attach is widely considered to be bad practice and should be avoided. One of the main dangers of attach is that data columns can become out of sync if they are modified individually. with avoids this pitfall because it is invoked one expression at a time. There are many, many questions on Stack Overflow where new users are following an old tutorial and run in to problems because of attach. The easy solution is always don't use attach.
Using with all the time seems too repetitive
If you are doing many steps of data manipulation, you may find yourself beginning every line of code with with(my_data, .... You might think this repetition is almost as bad as not using with. Both the data.table and dplyr packages offer efficient data manipulation with non-repetitive syntax. I'd encourage you to learn to use one of them. Both have excellent documentation.
I use it when i don't want to keep typing dataframe$. For example
with(mtcars, plot(wt, qsec))
rather than
plot(mtcars$wt, mtcars$qsec)
The former looks up wt and qsec in the mtcars data.frame. Of course
plot(qsec~wt, data = mtcars)
is more appropriate for plot or other functions that take a data= argument.
Related
In a regression or plotting context, the formula/tilde syntax is used to mean something like "The value of the LHS is dependent on the RHS". For example
lattice::xyplot(mpg ~ disp, data=mtcars)
displays disp as the x/independent variable and mpg as the y/dependent variable. However with dplyr::case_when() this logic seems reversed. The LHS contains the condition, while the RHS contains what is returned if the condition is TRUE. Hence the RHS is dependent on the LHS. Why is this dependency relationship reveresed for case_when() compared to other uses of formula syntax in R? Which is perhaps a different way of asking, why is my interpretation of what case_when() is doing wrong?
This is my theory. case_when extends else_if, and aims to be a if-then-else (case-when-else) that SQL and other language have. Think of ~ as a shortcut for then.
Looking at the source code
https://github.com/tidyverse/dplyr/blob/c2a3fca9ab85dd847605c7fda1d6b0b971f9381a/R/case-when.R
case_when uses the function meaning of ~ as a method to split the arguments, see docs for ?rlang::f_lhs().
Finally, this format is about readability, I've written several long data clean-up scripts with multiline case_when's, and they are readable thanks to this layout.
Say I have a dataframe data with 5 variables, var1,var2,var3,var4,var5. Imagine I want to do the following operation:
sqrt(data$var2)*(data$var4+data$var1)/data$var3 + data$var5/log(data$var3)
Is there a simpler way to just call the formula as follows?
sqrt(var2)*(var4+var1)/var3 + var5/log(var3)
Perhaps within another function? apply or the like? Can't get how to do it properly. Removing variables from the dataframe is not a desirable option.
Just wrap using with
with(data, sqrt(var2)*(var4+var1)/var3 + var5/log(var3))
Or another option is transform
transform(data, new = sqrt(var2)*(var4+var1)/var3 + var5/log(var3)))
Another option which is undesirable is creating column names as objects in the global env with attach. Then, can use the objects directly. But, it is not recommended because this will pollute the env with lots of objects and can have some side effects
attach(data)
sqrt(var2)*(var4+var1)/var3 + var5/log(var3)
I'm running some Surv() functions, and one thing I do not like, or understand, is why this function does not take a "data=" argument. This is annoying because I want to perform the same Surv() function on the same data frame but filtered by different criteria each time.
So for example, my data frame is called "ikt" and I want to filter by "donor_type2=='LD'" and also use a strata variable "plan 2". I tried the following but it didn't work:
library(survival)
library(dplyr)
ikt<-data.frame(organ_yrs=(seq(1,20)),
organ_status=rep(c(0,0,1,1),each=5),
plan2=rep(c('A','B','A','B'),each=5),
donor_type2=rep(c('LD','DD'),each=10) )
organ_surv_func<-function(data,criteria,strata) {
data2<-filter(data,criteria)
Surv(data2$organ_yrs,data2$organ_status)~data2$strata
}
organ_surv_func(ikt,donor_type2=='LD',plan2)
Error in filter_impl(.data, quo) : object 'donor_type2' not found
I'm coming from a SAS background so that's probably why I'm thinking this should work and it doesn't...
I looked up something about sapply(), but I don't think that works when the function doesn't have the data= option.
Also the reason I need the Surv() object and not just survfit(Surv()) (which would let me use data=) is because I'm also using survdiff() for log-rank tests, which takes in the Surv() object as it's main argument:
lr<-function (surv) {
round(1-pchisq(survdiff(surv)$chisq,length(survfit(surv)$strata)-1),3)
}
Thanks for any help you can provide.
I'm writing this "answer" to caution you against proceeding down the path you seem to be following. The Surv function is really intended to be used as the LHS of a formula defined within one of the survival package functions. You should avoid using constructions like:
Surv(data2$organ_yrs,data2$organ_status)~data2$strata
For one thing it's needlessly verbose, but more importantly, it will prevent the use of predict when it comes time to match up names to formals. The survdiff and the other survival functions all have both a "data" argument as well as a "subset" argument. The subset function should allow you to avoid using filter.
organ_surv_func<-function(data, covar) {
form = as.formula(substitute( Surv(organ_yrs, organ_status) ~ covar, list(covar=covar) ) )
survdiff(form, data=data)
}
# although I think running surdiff in a for-loop might be easier,
# as it would involve fewer tricky language constructs
organ_surv_func( subset(ikt, (donor_type2=='LD')), covar=quote(plan2))
If you assign the output of survfit to a named variable, you will be able to more economically access chisq and strata:
myfit <- organ_surv_func( subset(ikt, (donor_type2=='LD')), covar=quote(plan2))
my.lr.test<-function (myfit) {
round(1-pchisq(myfit$chisq, length(myfit$strata)-1), 3)
}
my.lr.test(myfit) # not going to be useful with that dataset.
I understand how to use aes, but I don't understand the programmatic paradigm.
When I use ggplot, assuming I have a data.frame with column names "animal" and "weight", I can do the following.
ggplot(df, aes(x=weight)) + facet_grid(~animal) + geom_histogram()
What I don't understand is that weight and animal are not supposed to be strings, they are just typed out as is. How is it I can do that? It should be something like this instead:
ggplot(df, aes(x='weight')) + facet_grid('~animal') + geom_histogram()
I don't "declare" weight or animal as vectors anywhere? This seems to be... really unusual? Is this like a macro or something where it gets aes "whole," looks into df for its column names, and then fills in the gaps where it sees those variable names in aes?
I guess what I would like is to see some similar function in R which can take variables which are not declared in the scope, and the name of this feature, so I can read further and maybe implement my own similar functions.
In R this is called non-standard evaluation. There is a chapter on non-standard evaluation in R in the Advanced R book available free online. Basically R can look at the the call stack to see the symbol that was passed to the function rather than just the value that symbol points to. It's used a lot in base R. And it's used in a slightly different way in the tidyverse which has a formal class called a quosure to make this stuff easier to work with.
These methods are great for interactive programming. They save keystrokes and clutter, but if you make functions that are too dependent on that function, they become difficult to script or include in other functions.
The formula syntax (the one with the ~) probably the safest and more programatic way to work with symbols. It captures symbols that can be later evaluated in the context of a data.frame with functions like model.frame(). And there are build in functions to help manipulate formulas like update() and reformulate.
And since you were explicitly interested in the aes() call, you can get the source code for any function in R just by typing it's name without the quotes. With ggplot2_2.2.1, the function looks like this
aes
# function (x, y, ...)
# {
# aes <- structure(as.list(match.call()[-1]), class = "uneval")
# rename_aes(aes)
# }
# <environment: namespace:ggplot2>
The newest version of ggplot uses different rlang methods to be more consistent with other tidyverse libraries so it looks a bit different.
Many intro R books and guides start off with the practice of attaching a data.frame so that you can call the variables by name. I have always found it favorable to call variables with $ notation or square bracket slicing [,2]. That way I can use multiple data.frames without confusing them and/or use iteration to successively call columns of interest. I noticed Google recently posted coding guidelines for R which included the line
1) attach: avoid using it
How do people feel about this practice?
I never use attach. with and within are your friends.
Example code:
> N <- 3
> df <- data.frame(x1=rnorm(N),x2=runif(N))
> df$y <- with(df,{
x1+x2
})
> df
x1 x2 y
1 -0.8943125 0.24298534 -0.6513271
2 -0.9384312 0.01460008 -0.9238312
3 -0.7159518 0.34618060 -0.3697712
>
> df <- within(df,{
x1.sq <- x1^2
x2.sq <- x2^2
y <- x1.sq+x2.sq
x1 <- x2 <- NULL
})
> df
y x2.sq x1.sq
1 0.8588367 0.0590418774 0.7997948
2 0.8808663 0.0002131623 0.8806532
3 0.6324280 0.1198410071 0.5125870
Edit: hadley mentions transform in the comments. here is some code:
> transform(df, xtot=x1.sq+x2.sq, y=NULL)
x2.sq x1.sq xtot
1 0.41557079 0.021393571 0.43696436
2 0.57716487 0.266325959 0.84349083
3 0.04935442 0.004226069 0.05358049
I much prefer to use with to obtain the equivalent of attach on a single command:
with(someDataFrame, someFunction(...))
This also leads naturally to a form where subset is the first argument:
with(subset(someDataFrame, someVar > someValue),
someFunction(...))
which makes it pretty clear that we operate on a selection of the data. And while many modelling function have both data and subset arguments, the use above is more consistent as it also applies to those functions who do not have data and subset arguments.
The main problem with attach is that it can result in unwanted behaviour. Suppose you have an object with name xyz in your workspace. Now you attach dataframe abc which has a column named xyz. If your code reference to xyz, can you guarantee that is references to the object or the dataframe column? If you don't use attach then it is easy. just xyz refers to the object. abc$xyz refers to the column of the dataframe.
One of the main reasons that attach is used frequently in textbooks is that it shortens the code.
"Attach" is an evil temptation. The only place where it works well is in the classroom setting where one is given a single dataframe and expected to write lines of code to do the analysis on that one dataframe. The user is unlikely to ever use that data again once the assignement is done and handed in.
However, in the real world, more data frames can be added to the collection of data in a particular project. Furthermore one often copies and pastes blocks of code to be used for something similar. Often one is borrowing from something one did a few months ago and cannot remember the nuances of what was being called from where. In these circumstances one gets drowned by the previous use of "attach."
Just like Leoni said, with and within are perfect substitutes for attach, but I wouldn't completely dismiss it. I use it sometimes, when I'm working directly at the R prompt and want to test some commands before writing them on a script. Especially when testing multiple commands, attach can be a more interesting, convenient and even harmless alternative to with and within, since after you run attach, the command prompt is clear for you to write inputs and see outputs.
Just make sure to detach your data after you're done!
I prefer not to use attach(), as it is far too easy to run a batch of code several times each time calling attach(). The data frame is added to the search path each time, extending it unnecessarily. Of course, good programming practice is to also detach() at the end of the block of code, but that is often forgotten.
Instead, I use xxx$y or xxx[,"y"]. It's more transparent.
Another possibility is to use the data argument available in many functions which allows individual variables to be referenced within the data frame. e.g., lm(z ~ y, data=xxx).
While I, too, prefer not to use attach(), it does have its place when you need to persist an object (in this case, a data.frame) through the life of your program when you have several functions using it. Instead of passing the object into every R function that uses it, I think it is more convenient to keep it in one place and call its elements as needed.
That said, I would only use it if I know how much memory I have available and only if I make sure that I detach() this data.frame once it is out of scope.
Am I making sense?