What's wrong with using $-extraction? - r

fortune(312) and fortune(343) allude to the problems with using $ to extract elements of a list instead of [[, but aren't specific about what exactly the dangers are.
The problem here is that the $ notation is a magical shortcut and like any other magic
if used incorrectly is likely to do the programmatic equivalent of turning yourself into
a toad.
-- Greg Snow (in response to a user that wanted to access a column whose name is stored
in y via x$y rather than x[[y]])
R-help (February 2012)
Sooner or later most R beginners are bitten by this all too convenient shortcut. As an R
newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable
consequences. It's best to acquire the '[[' and '[' habit early.
-- Peter Ehlers (about the use of $-extraction)
R-help (March 2013)
Looking through the documentation for `$`, I've found that
$ is only valid for recursive objects
and
The main difference is that $ does not allow computed indices, whereas [[ does [...] Also, the partial matching behavior of [[ can be controlled using the exact argument.
So, is the argument to use [[ over $ because the former offers greater control and transparency in writing code? What are the actual risks of using $, and if [[ is preferred are there any circumstances where it is appropriate to use $-extraction?

Consider the list:
foo<-list(BlackCat=1,BlackDog=2, WhiteCat=3,WhiteDog=4)
Suppose you want to call the indice according to the two user parametric variables: colour and animal species.
Parametrising the colour and the species somewhere in the code as
myColour<-"Black"
mySpecies<-"Dog"
you can make the call to index parametric easily as
foo[[paste0(myColour,mySpecies)]]
by using [[ or [. However, this is not case for $ extraction: foo$paste0(myColour,mySpecies) would not evaluate the function paste0.

Related

Selecting non existent columns in data.table

Here is my small example:
library(data.table)
data<-data.table(x_2dig_id=rnorm(100))
data$x_2dig
What I do not understand, why I do not get an error (i.e. there is no column x_2dig in my data).
This would be great if somebody could elaborate on this.
This happens with lists as well which is one of the very basic data structure type in R. $ is a shorthand operator, where x$y is equivalent to x[["y", exact = FALSE]]. It’s often used to access variables in a data frame.
If you want to receive a warning for partial matching then you can do :
options(warnPartialMatchDollar = TRUE)
From R documenation:
?`[[`
x[[i, exact = TRUE]]
exact Controls possible partial matching of [[ when extracting by a
character vector (for most objects, but see under ‘Environments’). The
default is no partial matching. Value NA allows partial matching but
issues a warning when it occurs. Value FALSE allows partial matching
without any warning.
and
Both [[ and $ select a single element of the list. The main difference
is that $ does not allow computed indices, whereas [[ does. x$name is
equivalent to x[["name", exact = FALSE]]. Also, the partial matching
behavior of [[ can be controlled using the exact argument.
This is also explained in Advanced R book of Hadley Wickham's subsetting chapter. You can find it here
This goes back to the old days of R being overly helpful and trying to guess what you meant. Data frames (upon which data tables are built) have this functionality, while tibbles (data_frame) intentionally leave this out to prevent the problem you're seeing! Hadley talks about this sometimes in his talks.
Because data.table knows that you have gotten a unique identifier of the column. Example below demonstrates that once it isn't unique, it fails.
data<-data.table(x_2dig_id=rnorm(100),x_2dig_id2=rnorm(100))
data$x_2dig
NULL
Edit: Per Joy, this maybe an R thing, and not a data.table per se'.

Trying to understand R structure: what does a dot in function names signify?

I am trying to learn how to use R. I can use it to do basic things like reading in data and running a t-test. However, I am struggling to understand the way R is structured (I am have a very mediocre java background).
What I don't understand is the way the functions are classified.
For example in is.na(someVector), is is a class? Or for read.csv, is csv a method of the read class?
I need an easier way to learn the functions than simply memorizing them randomly. I like the idea of things belonging to other things. To me it seems like this gives a language a tree structure which makes learning more efficient.
Thank you
Sorry if this is an obvious question I am genuinely confused and have been reading/watching quite a few tutorials.
Your confusion is entirely understandable, since R mixes two conventions of using (1) . as a general-purpose word separator (as in is.na(), which.min(), update.formula(), data.frame() ...) and (2) . as an indicator of an S3 method, method.class (i.e. foo.bar() would be the "foo" method for objects with class attribute "bar"). This makes functions like summary.data.frame() (i.e., the summary method for objects with class data.frame) especially confusing.
As #thelatemail points out above, there are some other sets of functions that repeat the same prefix for a variety of different options (as in read.table(), read.delim(), read.fwf() ...), but these are entirely conventional, not specified anywhere in the formal language definition.
dotfuns <- apropos("[a-z]\\.[a-z]")
dotstart <- gsub("\\.[a-zA-Z]+","",dotfuns)
head(dotstart)
tt <- table(dotstart)
head(rev(sort(tt)),10)
## as is print Sys file summary dev format all sys
## 118 51 32 18 17 16 16 15 14 13
(Some of these are actually S3 generics, some are not. For example, Sys.*(), dev.*(), and file.*() are not.)
Historically _ was used as a shortcut for the assignment operator <- (before = was available as a synonym), so it wasn't available as a word separator. I don't know offhand why camelCase wasn't adopted instead.
Confusingly, methods("is") returns is.na() among many others, but it is effectively just searching for functions whose names start with "is."; it warns that "function 'is' appears not to be generic"
Rasmus Bååth's presentation on naming conventions is informative and entertaining (if a little bit depressing).
extra credit: are there any dot-separated S3 method names, i.e. cases where a function name of the form x.y.z represents the x.y method for objects with class attribute z ?
answer (from Hadley Wickham in comments): as.data.frame.data.frame() wins. as.data.frame is an S3 generic (unlike, say, as.numeric), and as.data.frame.data.frame is its method for data.frame objects. Its purpose (from ?as.data.frame):
If a data frame is supplied, all classes preceding ‘"data.frame"’
are stripped, and the row names are changed if that argument is
supplied.

What do . (dot) and % (percentage) mean in R?

My question might sound stupid but I have noticed that . and % is often used in R and to be frank I don't really know why it is used.
I have seen it in dplyr (go here for an example) and data.table (i.e. .SD) but I am sure it must be used in other place as well.
Therefore, my question is:
What does . mean? Is it some kind of R coding best practice nomenclature? (i.e. _functionName is often used in javascript to indicate it is a private function). If yes, what's the rule?
Same question for %, which is also often used in R (i.e. %in%,%>%,...).
My guess always has been that . and % are a convenient way to quickly call function but the way data.table uses . does not follow this logic, which confuses me.
. has no inherent/magical meaning in R. It's just another character that you can use in symbol names. But because it is so convenient to type, it has been given special meaning by certain functions and conventions in R. Here are just a few
. is used look up S3 generic method implementations. For example, if you call a generic function like plot with an object of class lm as the first parameter, then it will look for a function named plot.lm and, if found, call that.
often . in formulas means "all other variables", for example lm(y~., data=dd) will regress y on all the other variables in the data.frame dd.
libraries like dplyr use it as a special variable name to indicate the current data.frame for methods like do(). They could just as easily have chosen to use the variable name X instead
functions like bquote use .() as a special function to escape variables in expressions
variables that start with a period are considered "hidden" and will not show up with ls() unless you call ls(all.names=TRUE) (similar to the UNIX file system behavior)
However, you can also just define a variable named my.awesome.variable<-42 and it will work just like any other variable.
A % by itself doesn't mean anything special, but R allows you to define your own infix operators in the form %<something>% using two percent signs. If you define
`%myfun%` <- function(a,b) {
a*3-b*2
}
you can call it like
5 %myfun% 2
# [1] 11
MrFlick's answer doesn't cover the usage of . in data.table;
In data.table, . is (essentially) an alias for list, so any* call to [.data.table that accepts a list can also be passed an object wrapped in .().
So the following are equivalent:
DT[ , .(x, y)]
DT[ , list(x, y)]
*well, not quite. any use in the j argument, yes; elsewhere is a work in progress, see here.

lapply-ing with the "$" function

I was going through some examples in hadley's guide to functionals, and came across an unexpected problem.
Suppose I have a list of model objects,
x=1:3;y=3:1; bah <- list(lm(x~y),lm(y~x))
and want to extract something from each (as suggested in hadley's question about a list called "trials"). I was expecting one of these to work:
lapply(bah,`$`,i='call') # or...
lapply(bah,`$`,call)
However, these return nulls. It seems like I'm not misusing the $ function, as these things work:
`$`(bah[[1]],i='call')
`$`(bah[[1]],call)
Anyway, I'm just doing this as an exercise and am curious where my mistake is. I know I could use an anonymous function, but think there must be a way to use syntax similar to my initial non-solution. I've looked through the places $ is mentioned in ?Extract, but didn't see any obvious explanation.
I just realized that this works:
lapply(bah,`[[`,i='call')
and this
lapply(bah,function(x)`$`(x,call))
Maybe this just comes down to some lapply voodoo that demands anonymous functions where none should be needed? I feel like I've heard that somewhere on SO before.
This is documented in ?lapply, in the "Note" section (emphasis mine):
For historical reasons, the calls created by lapply are unevaluated,
and code has been written (e.g. bquote) that relies on this. This
means that the recorded call is always of the form FUN(X[[0L]],
...), with 0L replaced by the current integer index. This is not
normally a problem, but it can be if FUN uses sys.call or
match.call or if it is a primitive function that makes use of the
call. This means that it is often safer to call primitive functions
with a wrapper, so that e.g. lapply(ll, function(x) is.numeric(x))
is required in R 2.7.1 to ensure that method dispatch for is.numeric
occurs correctly.

Why is using assign bad?

This post (Lazy evaluation in R – is assign affected?) covers some common ground but I am not sure it answers my question.
I stopped using assign when I discovered the apply family quite a while back, albeit, purely for reasons of elegance in situations such as this:
names.foo <- letters
values.foo <- LETTERS
for (i in 1:length(names.foo))
assign(names.foo[i], paste("This is: ", values.foo[i]))
which can be replaced by:
foo <- lapply(X=values.foo, FUN=function (k) paste("This is :", k))
names(foo) <- names.foo
This is also the reason this (http://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-I-turn-a-string-into-a-variable_003f) R-faq says this should be avoided.
Now, I know that assign is generally frowned upon. But are there other reasons I don't know? I suspect it may mess with the scoping or lazy evaluation but I am not sure? Example code that demonstrates such problems will be great.
Actually those two operations are quite different. The first gives you 26 different objects while the second gives you only one. The second object will be a lot easier to use in analyses. So I guess I would say you have already demonstrated the major downside of assign, namely the necessity of then needing always to use get for corralling or gathering up all the similarly named individual objects that are now "loose" in the global environment. Try imagining how you would serially do anything with those 26 separate objects. A simple lapply(foo, func) will suffice for the second strategy.
That FAQ citation really only says that using assignment and then assigning names is easier, but did not imply it was "bad". I happen to read it as "less functional" since you are not actually returning a value that gets assigned. The effect looks to be a side-effect (and in this case the assign strategy results in 26 separate side-effects). The use of assign seems to be adopted by people that are coming from languages that have global variables as a way of avoiding picking up the "True R Way", i.e. functional programming with data-objects. They really should be learning to use lists rather than littering their workspace with individually-named items.
There is another assignment paradigm that can be used:
foo <- setNames( paste0(letters,1:26), LETTERS)
That creates a named atomic vector rather than a named list, but the access to values in the vector is still done with names given to [.
As the source of fortune(236) I thought I would add a couple examples (also see fortune(174)).
First, a quiz. Consider the following code:
x <- 1
y <- some.function.that.uses.assign(rnorm(100))
After running the above 2 lines of code, what is the value of x?
The assign function is used to commit "Action at a distance" (see http://en.wikipedia.org/wiki/Action_at_a_distance_(computer_programming) or google for it). This is often the source of hard to find bugs.
I think the biggest problem with assign is that it tends to lead people down paths of thinking that take them away from better options. A simple example is the 2 sets of code in the question. The lapply solution is more elegant and should be promoted, but the mere fact that people learn about the assign function leads people to the loop option. Then they decide that they need to do the same operation on each object created in the loop (which would be just another simple lapply or sapply if the elegant solution were used) and resort to an even more complicated loop involving both get and apply along with ugly calls to paste. Then those enamored with assign try to do something like:
curname <- paste('myvector[', i, ']')
assign(curname, i)
And that does not do quite what they expected which leads to either complaining about R (which is as fair as complaining that my next door neighbor's house is too far away because I chose to walk the long way around the block) or even worse, delve into using eval and parse to get their constructed string to "work" (which then leads to fortune(106) and fortune(181)).
I'd like to point out that assign is meant to be used with environments.
From that point of view, the "bad" thing in the example above is using a not quite appropriate data structure (the base environment instead of a list or data.frame, vector, ...).
Side note: also for environments, the $ and $<- operators work, so in many cases the explicit assign and get isn't necessary there, neither.

Resources